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Abstract 


This  research  is  an  attempt  to  reason  about  the  control  of  parallel  computation  in  the 
world  of  applicative  programming  languages. 

Applicative  languages,  in  which  computation  is  performed  through  function  applica¬ 
tion  and  in  which  functions  are  treated  as  first-class  objects,  have  the  benefits  of  elegance, 
expressiveness  and  having  clean  semantics.  Parallel  computation  and  real-world  concur¬ 
rent  activities  are  much  harder  to  reason  about  than  the  sequential  counterparts.  Many 
parallel  applicative  languages  have  thus  hidden  most  control  details  with  their  declarative 
programming  styles,  but  they  are  not  expressive  enough  to  characterize  many  real  world 
concurrent  activities  that  can  be  easily  explained  with  concepts  such  as  message  passing, 
pipelining  and  so  on. 

Ease  of  programming  should  not  come  at  the  expense  of  expressiveness.  Therefore, 
we  design  a  parallel  applicative  language  Pscheme  such  that  programmers  can  express 
explicitly  the  control  of  parallel  computation  while  maintaining  the  clean  semantics  and 
the  ease  of  programming  of  applicative  languages.  In  Pscheme,  we  propose  the  concept  of 
ports  to  model  the  general  control  in  parallel  computation.  Through  program  examples, 
we  show  how  Pscheme  and  ports  support  various  parallel  programming  paradigms.  We 
have  also  built  libraries  for  higher  level  control  facilities  with  ports  so  that  programming 
in  Pscheme  becomes  easier. 

We  provide  an  operational  semantics  for  Pscheme,  and  develop  a  compiler  and  a  run 
time  system  on  NYU’s  Ultracomputer.  Our  experiments  with  parallel  programs  have 
shown  satisfactory  speedup.  We  claim  that  ports  are  the  natural  parallel  extensions  of 
continuations  in  sequential  computation,  and  thus  conclude  that  representing  general 
control  in  parallel  applicative  programming  is  feasible. 
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Chapter  1 


Introduction 


This  research  is  an  attempt  to  reason  about  parallel  computation  in  the  world  of  ap¬ 
plicative  programming  by  exposing  control  details  in  the  language  level  to  obtain  more 
expressive  power  in  a  language. 

In  the  dissertation,  we  propose  a  parallel  applicative  language  Pscheme  to  illustrate 
our  idea.  We  provide  a  formal  definition  of  Pscheme,  describe  its  implementation,  and 
perform  experiments  with  various  examples. 

1.1  Motivation 

Applicative  languages,  in  which  computation  is  performed  through  function  application 
and  in  which  functions  are  treated  as  first-class  objects,  have  the  benefits  of  elegance, 
expressiveness  and  having  clean  semantics. 

With  the  increasing  need  of  parallel  computation,  there  have  been  proposals  for  vari¬ 
ous  parallel  applicative  languages.  Since  parallel  computation  and  real-world  concurrent 
activities  are  much  harder  to  reason  about  than  the  sequential  counterparts,  many  par¬ 
allel  languages  have  hidden  most  control  details  and  support  a  particular  programming 
paradigm  which  represents  a  certain  aspect  of  parallel  computation.  In  the  case  of  ap- 


plicative  languages,  the  aspect  that  most  parallel  dialects  illustrate  is  to  apply  functions 
in  parallel. 

Clearly,  not  all  parallel  behaviors  can  be  modeled  by  function  application,  but  the 
declarative  style  of  applicative  programming  hides  all  other  control  details.  Ease  of  pro¬ 
gramming  should  not  come  at  the  expense  of  expressiveness.  Therefore,  the  goal  of  this 
research  is  to  design  a  parallel  applicative  language  such  that  programmers  can  express 
explicitly  the  control  of  parallel  computation  while  maintaining  the  clean  semantics  and 
the  ease  of  programming  of  applicative  languages. 

Specifically,  we  are  proposing  a  language  that 

•  supports  more  expressive  parallel  applicative  programming, 

•  provides  a  general  control  mechanism  in  parallel  computation, 

•  is  compatible  with  different  parallel  programming  paradigms,  and 

•  can  characterize  various  parallel  behaviors  in  the  real  world. 

We  also  believe  that  parallel  computing  means  more  than  high  performance  comput¬ 
ing.  From  a  language  designer’s  view  point,  a  language  should  executes  well  on  various 
computer  architectures,  so  should  it  model  well  on  various  issues  and  activities.  Through 
our  proposed  language,  we  hope  to  answer  the  following  questions: 

•  How  expressive  can  a  parallel  language  be? 

•  How  hard  is  it  for  a  programmer  to  reason  about  or  model  parallel  computation? 

•  How  efficient  can  it  be  for  a  parahel  language  to  support  multiple  programming 
paradigms? 
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1.2  Approach  and  Background 


To  provide  a  general  control  mechanism  in  a  parallel  language,  it  is  not  enough  to  have 
just  a  set  of  process  control  primitives.  It  would  be  better  to  have  an  orthogonal  construct 
that  can  represent  a  particular  control  point  during  the  computation.  This  approach  is 
adopted  by  many  sequential  languages. 

In  the  world  of  lamb  da- calculus  based  languages,  each  expression  has  a  return  value, 
and  the  returning  of  this  value  corresponds  to  a  state  in  the  course  of  the  whole  com¬ 
putation  based  on  a  particular  evaluation  order.  This  state  can  be  made  available  to 
programmers  as  a  language-level  construct  so  as  to  model  control  in  many  applicative 
languages,  and  is  called  a  continuation.  Since  we  are  focusing  on  parallel  applicative  pro¬ 
gramming,  we  would  like  to  model  as  well  the  state  that  an  expression  represents  during 
the  parallel  execution  of  a  program  This  motivates  our  proposing  the  concept  of  ports , 
which  we  will  explain  in  following  chapters. 

1.2.1  Continuations 

The  concept  of  continuations  traces  back  to  denotational  models  of  the  call-by- value 
lambda  calculus  [53,54].  Traditional  semantics  of  a  lambda  expression  is  based  on  a 
meaning  function  of  a  domain  expression  and  an  environment.  The  meaning  function 
gives  a  denotation  of  the  expression  in  the  environment.  An  alternative  semantic  model 
is  the  continuation  model,  in  which,  the  meaning  function  takes  one  more  argument  -  a 
continuation  function,  and  the  denotation  of  the  expression  in  the  environment  is  obtained 
by  applying  the  continuation  function  with  the  result  of  interpreting  the  expression  in  the 
environment.  This  is  called  the  continuation  passing  style  (CPS)  semantics. 

The  CPS  semantics  makes  the  expression  evaluation  order  explicit  by  specifying  each 
sub-expression’s  meaning  and  its  continuation  in  turn  so  as  to  get  the  meaning  of  the  whole 
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expression.  This  denotational  model  helps  provide  language-level  continuations  because 
it  becomes  easier  to  develop  a  CPS  denotational  semantics  for  a  language  providing  first- 
class  continuations,  and  it  is  also  easier  to  implement  it. 

In  most  applicative  languages,  continuations  appear  in  the  form  of  one-argument  func¬ 
tions.  At  the  interpretation  meta-level,  however,  applying  such  a  one-argument  function 
causes  computation  to  go  on  towards  the  completion  of  the  whole  evaluation.  In  other 
words,  application  of  continuation  function  never  returns  even  if  it  happens  within  an  ex¬ 
pression  block.  In  fact,  it  serves  as  a  control  point  in  the  course  of  the  whole  computation. 
Such  a  control  point,  when  provided  in  the  language  level,  becomes  a  very  useful  tool  to 
programmers. 

Therefore,  with  first-class  continuations  provided,  programs  become  more  expressive 
because  any  control  point  can  be  captured  in  the  form  of  a  continuation.  Various  control 
facilities  such  as  non-local  escape,  J-operator  [48],  label  values  [56],  coroutines  [32],  and 
engines  [14]  can  hence  be  implemented. 

It  is  known  that  continuations  provide  a  general  control  mechanism  in  applicative 
languages  with  acceptable  performances  (close  to  high  order  languages  without  contin¬ 
uations).  However,  continuations  also  decrease  the  readability  of  programs  written  in 
applicative  languages.  Solving  the  trade-off  between  readability  and  expressiveness  relies 
on  building  higher  level  control  operators  so  that  the  language  has  enough  expressive 
power  while  it  is  still  easy  to  program  with. 

1.2.2  Extensions  of  Continuations 

In  a  parallel  applicative  language,  it  is  natural  to  represent  control  with  an  extension  of 
continuations.  There  have  been  attempts  to  propose  parallel  continuations  [33].  Since 
continuations  represent  control  points  in  sequential  threads,  most  proposed  parallel  con- 
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tinuations  represent  the  rest  of  the  computation  for  a  particular  process.  Invoking  a 
parallel  continuation  still  behaves  like  a  transfer  of  control,  but  the  concept  of  the  rest  of 
computation  does  not  apply  across  processes. 

Our  approach  views  parallel  continuations  as  the  links  in  the  data-flow  diagram  rep¬ 
resenting  the  expression  being  evaluated  in  parallel.  For  example,  in  a  Scheme  expression 
(+  el  e2),  el  and  e2  may  be  evaluated  in  parallel  before  +  is  applied.  The  two  links 
in  the  data-flow  diagram  shown  below  can  be  reified  as  language-level  constructs  which 
represents  ports  to  the  +  operator.  The  +  operation  will  be  triggered  when  data  elements 
flow  to  both  ports. 


/  \ 

el  e2 

In  any  given  expression  of  an  applicative  language,  there  is  a  corresponding  data¬ 
flow  diagram.  If  those  links  in  the  data-flow  diagram  become  first-class  language-level 
constructs,  it  is  then  possible  to  make  any  link  accessible  to  any  expression  node,  and 
hence  enables  us  to  provide  a  general  control  mechanism  with  them. 

Such  intuition  motivates  our  designing  the  language  Pscheme,  a  parallel  dialect  of 
Scheme,  and  its  primary  new  construct,  the  port ,  a  concrete  representation  for  those 
conceptual  links.  Ports  distinguish  continuations  conceptually  in  the  following  aspects: 

1.  A  port  no  longer  represents  the  rest  of  the  computation.  In  the  above  example,  each 
port  only  represents  some  information  about  the  rest  of  computation. 

2.  A  port  is  no  longer  a  procedural  abstraction  or  a  label  for  control  to  be  transferred 
to.  Data  elements  passed  to  a  port  sometimes  have  to  synchronize  with  those  in 
other  ports  before  following  computation  starts. 
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Formal  and  detailed  definitions  regarding  ports  are  provided  in  subsequent  chapters. 
In  our  approach  to  model  control  in  parallel  applicative  programming,  we  believe  that 
ports  are  the  natural  parallel  extension  of  continuations  not  only  in  the  language  level, 
but  also  in  the  semantics  and  implementation  aspects. 

Ports  are  useful  to  explain  the  semantics  of  multi-threaded  applicative  languages.  As 
continuations  are  used  in  compilers  for  applicative  languages  (the  intermediate  represen¬ 
tation  is  in  continuation  passing  style),  ports  can  be  also  be  used  to  implement  synchro¬ 
nization  among  function  arguments  evaluated  in  parallel.  We  will  give  further  details  in 
chapters  regarding  formal  semantics  and  the  Pscheme  compiler. 

1.3  Dissertation  Outline 

The  dissertation  presents  the  work  to  represent  general  control  in  parallel  applicative 
programming  through  the  language  Pscheme.  In  chapter  2,  we  define  Pscheme,  describe 
its  features,  and  introduce  the  first-class  language  construct  port. 

Chapter  3  discusses  programming  in  Pscheme  and  presents  various  programming 
styles.  It  demonstrates  how  ports  can  be  used  to  represent  different  control  facilities. 

Chapter  4  gives  a  formal  definition  of  Pscheme.  We  have  an  operational  semantics 
which  interprets  Pscheme  expressions  in  terms  of  transition  rules  on  our  abstract  reduction 
machine. 

The  implementation  work  is  shown  in  chapter  5  and  6.  Chapter  5  describes  our 
Pscheme-to-C  translator,  and  chapter  6  discusses  the  run  time  system  and  our  execution 
model. 

We  present  our  performance  results  in  chapter  7,  and  discuss  issues  of  scalability, 
granularity,  slow  down  factors  and  so  on. 

Chapter  8  discusses  related  works  in  other  languages  including:  Multilisp,  Actors, 
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Lucid  and  Qlisp,  and  other  parallel  extensions  of  continuations.  We  will  also  make  com¬ 
parisons  between  Pscheme  and  those  languages. 

Chapter  9  describes  our  future  work  and  concludes  the  dissertation. 
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Chapter  2 

Pscheme  —  A  Parallel  Applicative 
Language  Providing  a  General 
Control  Mechanism 


In  this  chapter,  we  describe  Pscheme,  a  parallel  dialect  of  Scheme  [091].  The  primary 
construct  for  specifying  parallelism,  synchronization,  and  communication  is  a  natural  ex¬ 
tension  of  first-class  continuations  which  we  call  a  port.  The  benefit  of  using  Pscheme  is 
that  the  user  has  complete  control  over  the  order  of  parallel  evaluation,  while  at  the  same 
time  benefiting  from  the  use  of  a  very  high  level  language  like  Scheme.  Other  parallel  vari¬ 
ants  of  LISP  and  Scheme  have  been  proposed,  based  on  first-class  continuations,  but  we 
feel  that  our  ports  provide  the  most  natural  parallel  extension  of  sequential  continuations 
of  the  methods  suggested. 

Like  first-class  continuations,  ports  are  very  powerful  constructs  that  can  lead  to 
complicated  programs  that  are  difficult  to  read.  We  envision  ports  (and  continuations)  to 
be  primarily  the  tool  of  advanced  system  programmers  who  provide  libraries  of  higher  level 
parallel  constructs  (futures,  barriers,  engines,  etc.)  for  use  by  the  general  programming 


community.  We  show  how  ports  are  sufficiently  powerful  to  easily  implement  such  high 
level  parallel  constructs  in  chapter  3. 


We  present  the  language  Pscheme  in  this  chapter  through  our  design  motivation  and 
short  program  examples.  Formal  semantics  will  be  given  later  in  chapter  4. 

2.1  Main  Features 

Pscheme  is  a  parallel  extension  of  the  Lisp  dialect  Scheme.  It  can  be  viewed  as  the 
following  formula  shows: 

Pscheme  =  Scheme  +  Parallel  Control  +  First  Class  Ports 

The  main  features  of  Pscheme  are: 

1.  As  in  Scheme,  Pscheme  is  lexically  scoped,  has  call  by  value  semantics,  and  supports 
first-class  functions. 

2.  It  is  a  multi-threaded  language  with  a  global  name  space. 

3.  It  provides  a  first-class  construct  named  port,  which  represents  a  particular  control 
point  and  its  related  environmental  information  in  the  course  of  computation. 

4.  It  provides  an  ordering  mechanism  through  ports  so  that  order  preservation  of  data 
elements,  messages  or  thread  execution  is  all  possible. 

5.  It  provides  exclusively  accessed  functions  (similar  to  Hoare’s  monitor  [35])  so  that 
there  can  only  be  one  activation  of  an  exclusive  function  at  any  time  during  the 
execution.  These  functions  are  used  to  control  the  thread  ordering  as  well. 

6.  It  supports  horizontal  parallelism ,  in  which  the  arguments  in  a  function  call  are 
evaluated  in  parallel,  as  well  as  vertical  parallelism ,  which  is  parallelism  between  a 
producer  (e.g.  of  stream  of  values)  and  a  consumer. 
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As  the  above  formula  shows,  Scheme  is  a  subset  of  Pscheme.  A  program  that  does 
not  contain  any  parallel  construct  has  exactly  the  same  semantics  as  in  Scheme.  In  other 
words,  any  sequential  thread  in  Pscheme  is  actually  the  execution  of  a  Scheme  expression. 

2.2  Definitions  of  Pscheme  Parallel  Constructs 

Pscheme  contains  relatively  few  constructs  for  specifying  parallel  program  behavior.  These 
constructs,  primarily  pcall,  call/mp,  call/sp,  throw,  die,  and  exclusive  are  simple 
to  describe  and  use,  but  are  very  powerful  when  used  in  conjunction  with  each  other. 

2.2.1  pcall:  A  Simple  Construct  for  Expressing  Paral¬ 
lelism 

Before  discussing  ports  and  continuations,  we  introduce  pcall,  the  basic  construct  for 
expressing  parallelism  in  Pscheme.  It  is  a  simple  fork-join  construct  seen  in  many  lan¬ 
guages.  The  innovation  of  the  work  described  here  results  from  the  interaction  of  pcall 
with  ports,  as  described  later  in  this  paper.  The  pcall  construct  takes  the  form 

(pcall  eO  el  ...  en) 

and  causes  the  evaluation  of  el,  ...  ,  en  in  parallel  after  eO  is  evaluated.  When  each 

of  the  expressions  ei  has  been  evaluated  to  its  value  ty,  the  procedure  value  vq  (resulting 
from  evaluating  eO)  is  invoked  with  arguments  v\---vn.  It  is  important  to  note  that 
the  body  of  the  procedure  is  not  invoked  until  the  parallel  evaluation  of  the  arguments 
has  completed.  Thus,  pcall  serves  as  a  barrier  synchronization  for  the  evaluation  of 
the  expressions  el  ...  en.  This  synchronization  becomes  very  important  in  our  later 
examples. 

2.2.2  Ports:  Parallel  Extensions  of  Continuations 
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Figure  2.1:  A  simple  data-flow  diagram 

The  expression  (pcall  f  a  b  c)  can  be  thought  of  as  being  graphically  represented  by 
the  picture  in  figure  2.1. 

When  a,  b,  and  c  have  been  evaluated,  their  values  are  sent  along  their  corresponding 
arcs  to  f.  In  order  to  capture  the  arcs  as  first-class  values,  we  introduce  a  construct  called 
call-with-current-multiport ,  which  is  written  call/mp  (the  term  “multi-”  will  be  addressed 
later).  Like  its  sequential  analog  call/cc,  call/mp  takes  a  single  parameter  which  is  a 
function  that  takes  a  single  parameter,  the  current  port,  that  is,  the  arc. 

Thus,  the  expression 

(pcall  f  (call/mp  (lambda  (pi)  a)) 

(call/mp  (lambda  (p2)  b)) 

(call/mp  (lambda  (p3)  c))) 

binds  pi,  p2,  and  p3  to  the  arcs  as  shown  in  figure  2.1 

Looking  at  the  illustration  as  a  data-flow  graph,  it  is  clear  that  f  is  invoked  whenever 
a  trio  of  values  are  sent  to  ports  pi,  p2,  and  p3.  It  was  mentioned  above  that  one  way  to 
produce  values  along  the  arcs  is  for  the  evaluation  of  a,  b,  and  c  to  complete.  However, 
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now  that  pi,  p2,  and  p3  are  tangible  objects,  values  can  also  be  sent  down  the  arcs  by 
explicitly  throwing  the  values  to  the  respective  ports  (using  the  throw  construct  described 
below).  With  the  explicit  use  of  ports,  f  can  be  invoked  many  times,  once  each  time  all 
of  its  input  ports  have  a  value  available  (due  to  a  throw  or  a  returned  value).  Thus,  not 
only  is  parallelism  generated  by  the  use  of  pcall,  but  also  by  multiple  invocations  of  f 
occurring  in  parallel  due  to  multiple  values  being  thrown  to  each  of  f ’s  input  ports.  The 
image  of  a  single  f  being  a  consumer  of  the  values  is  no  longer  accurate.  A  new  activation 
of  f  is  created  for  each  trio  of  values  thrown  to  its  input  ports  (later  on,  we  will  introduce 
a  new  construct,  exclusive,  that  specifies  that  only  one  activation  of  f  can  execute  at 
any  given  time). 

2.2.3  Multi  vs.  Single  Ports 

It  might  also  be  desirable  for  a  given  function  to  be  invoked  only  once,  but  to  be  able  to 
invoke  it  either  by  having  its  arguments  return  values,  or  by  explicitly  passing  values  to 
its  input  ports.  That  is,  it  might  be  desirable  to  restrict  the  flow  of  data  to  each  input 
port  to  a  single  value.  After  a  value  is  sent  to  the  port,  the  port  shuts  down  and  refuses 
to  accept  any  more  values.  Subsequent  values  thrown  to  that  port  have  no  effect. 

This  special  kind  of  port  is  called  a  single-port ,  and  is  made  into  a  first-class  object 
by  the  construct  call/sp  (meaning  call-with-current-single-port).  The  usual  port,  one  to 
which  many  values  can  be  sent,  is  therefore  called  a  multi-port. 

In  general,  each  sub-expression  in  a  Pscheme  program  corresponds  to  a  port  which  is 
implicitly  multi-entrant  if  it  is  not  made  single-entrant  by  call/sp. 

2.2.4  Pscheme  Thread  Creation  and  Termination 

In  Pscheme,  threads  (processes)  are  created  by  the  pcall  construct,  as  described  above. 
In  addition,  the  throw  construct,  which  sends  a  value  to  a  port  explicitly,  sometimes  also 
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creates  a  thread.  Thread  termination  is  specified  by  the  die  construct,  which  terminates 
the  current  thread’s  computation. 

(throw  el  e2)  evaluates  expressions  el  and  e2,  resulting  in  a  port  p  and  a  value  v 
respectively,  sends  v  to  p,  and  returns  v.  If  p  is  a  closed  single-port,  then  the  throw  has 
no  effect  other  than  to  return  v.  However,  if  p  is  a  multi-port  or  an  open  single-port,  a 
new  thread  is  created  if  the  receiver  is  not  blocked  waiting  for  values  from  other  ports. 
Otherwise,  v  is  queued  within  p.  1  In  either  case,  the  throw  expression  returns  the  value 
v,  and  the  current  thread  continues.  2  throw  is  an  atomic  action  in  the  sense  that  if  a 
process  throws  one  value  after  another  to  the  same  port,  the  arrival  order  is  the  same  as 
the  sending  order. 

(die)  terminates  the  computation  of  the  current  thread  and  returns  no  value.  A 
(die)  in  a  parent  process  does  not  kill  its  child  processes.  Here  is  an  example  of  the 
typical  use  of  a  single-port:  to  commit  to  the  first  value  computed  by  competing  parallel 
processes.  For  example,  suppose  a  binary  tree  is  encoded  in  the  list  form  (key  left-child 
right-child).  To  find  an  element  with  a  key  value  in  this  tree,  a  parallel  search  along 
all  branches  can  be  performed  as  shown  figure  2.2.  The  first  value  passed  to  the  answer 
port  is  the  value  that  find  returns. 

2.2.5  Exclusive  Functions 

As  described  above,  a  new  activation  of  a  function  is  created  each  time  there  is  a  value 
available  on  each  of  its  input  ports.  Thus,  this  is  a  method  for  creating  multiple  invoca¬ 
tions  of  the  same  function  in  parallel. 

1  For  example,  a  new  thread  cannot  be  spawned  if  the  value  is  passed  to  a  port  associated  with  a  function 

argument  position,  and  it  has  to  synchronize  with  other  arguments  in  order  to  apply  the  function. 

2  Notice  that  our  definition  of  throw  is  different  from  those  in  sequential  languages  supporting  contin¬ 
uations.  Our  throw  does  not  transfer  control,  but  tries  to  spawn  a  new  thread. 
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(define  (findl  x  tr  port) 

(cond  ((null?  tr)  #f) 

((eq?  x  (car  tr))  (throw  port  #t)  (die)) 
(else  (pcall  (lambda  (al  a2)  #f) 

(findl  x  (cadr  tr)  port) 
(findl  x  (caddr  tr)  port))))) 

(define  (find  elt  tree) 

(call/sp  (lambda  (p)  (findl  elt  tree  p)))) 


Figure  2.2:  Finding  an  element  in  a  tree  in  parallel 


It  might  be  the  case,  however,  that  it  is  desirable  for  there  to  be  only  one  invocation 
of  the  function  active  at  any  given  time.  Such  a  function  is  said  to  be  exclusive  (and  has 
essentially  the  same  behavior  as  a  Hoare’s  monitor).  An  exclusive  function  is  desirable, 
for  example,  if  it  modifies  some  data  structure  that  requires  mutually  exclusive  access. 
Another  (related)  possibility  is  that  the  function  produces  output  or  results  whose  order 
is  important. 

The  Pscheme  construct  exclusive  is  used  to  create  exclusive  functions.  The  expression 

(exclusive  e) 

evaluates  e  to  some  function  value  /,  and  returns  a  function  f  with  the  same  behavior  as 
/  except  that  f  is  exclusive.  When  f  is  called,  subsequent  requests  to  call  f  are  blocked 
and  queued  until  the  current  invocation  of  f  returns  or  a  (die)  is  executed  during  the 
evaluation  of  the  current  invocation  of  f . 

Consider  the  program  in  figure  2.3  for  comparing  the  fringes  of  two  trees.  The  paral¬ 
lelism  between  the  searching  of  the  two  trees  is  desirable.  However,  an  exclusive  version 
of  compare  is  essential  to  avoid  a  race  condition  in  which  compare  reports  success  because 
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(define  (samefringe  tl  t2) 

(call/sp  (lambda(p) 

(pcall  (exclusive  compare) 

(call/mp  (lambda(pl)  (search  tl  pi)  nil)) 

(call/mp  (lambda(p2)  (search  t2  p2)  nil)))))) 

(define  (search  tree  outport) 

(cond  ((null?  tree)  nil) 

((atom?  tree)  (throw  outport  tree)) 

(else  (search  (car  tree)  outport) 

(search  (cdr  tree)  outport)))) 

(define  (compare  a  b) 

(cond  ((null?  a)  (null?  b)) 

((eq?  a  b)  (die)) 

(else  nil))) 

Figure  2.3:  Checking  if  two  trees  have  the  same  fringe  in  parallel.  In  this  program, 
nil  is  not  considered  as  a  leaf  but  a  null  child  link. 
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the  right  most  leaves  of  the  two  trees  are  identical,  even  though  some  of  the  interior  leaves 
(which  may  be  different  in  the  two  trees)  have  not  yet  been  compared.  The  call/sp  in 
samefringe  avoids  the  problem  of  success  being  reported  after  failure  has  already  been 
reported. 

(exclusive  f)  returns  a  function  f  ’  which  can  have  only  one  activation  at  any  time. 
However,  during  the  evaluation  of  f  ’ ,  exclusiveness  can  be  violated  if  pcall  is  invoked,  or  a 
multi-port  is  captured  inside  of  f  ’  and  gets  thrown  a  value.  The  language  definition  only 
enforces  exclusiveness  of  the  entry  point  of  an  exclusive  function.  The  effect  of  creating 
parallel  threads  executing  the  body  of  an  exclusive  function  is  left  unspecified  by  the 
language  definition  and  should  be  avoided. 

2.2.6  Further  Notes  about  Definitions 

We  have  introduced  all  constructs  regarding  parallelism  in  Pscheme.  All  these  primitives 
have  to  appear  in  the  forms  as  described.  None  of  them  is  a  first-class  function.  For 
example,  (pcall  throw  el  e2)  and  (exclusive  call/mp)  are  both  illegal  expressions. 
In  other  words,  pcall,  throw,  call/mp,  call/cp,  exclusive  and  die  are  all  special 
terms. 

The  concept  of  threads  needs  further  explanation.  A  thread  is  the  execution  of  a 
Pscheme  expression.  When  a  throw  happens,  say  (throw  p  v),  the  current  thread  con¬ 
tinues  by  taking  the  return  value  v  of  the  throw  expression.  A  new  thread ,  if  necessary,  is 
created  to  execute  the  context  that  the  port  p  represents.  This  is  considered  a  side  effect 
of  the  the  throw  expression.  Therefore,  in  the  body  of  an  exclusive  function  f ,  there  can 
be  a  throw  of  an  object  to  a  port  outside  of  the  body,  and  does  not  violate  the  exclusive 
property  of  f .  There  is  still  only  one  thread  executing  the  function  body. 

With  the  existence  of  multiports,  codes  become  re-entrant,  and  expressions  could 
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return  values  more  than  once.  However,  at  the  top  level,  a  program  should  give  its  answer 
only  once.  Top-level  parallelism,  where  a  program  keeps  returning  values  from  multi-port 
outputs,  is  not  allowed.  In  other  words,  any  top-level  expression  (the  expression  that 
represents  the  whole  program)  can  be  viewed  as  implicitly  associated  with  a  single-port. 
Whenever  a  program  returns  its  result,  the  whole  computation  stops  even  if  there  are  still 
running  threads.  This  is  also  guaranteed  in  our  formal  semantics. 

On  the  other  hand,  a  program  might  not  return  anything  because  a  die  blocks  all 
threads  at  the  top  level.  Such  programs  are  considered  non-terminating. 

2.3  The  Relationship  Between  Ports  and  Con¬ 
tinuations 

Although  we  introduced  call/mp  in  the  context  of  pcall,  they  are  really  orthogonal  con¬ 
structs.  We  saw  above  that  pcall  is  useful  by  itself,  and  is  actually  a  very  common 
parallel  construct.  What  is  the  effect  of  using  call/mp  without  using  pcall?  Consider 
the  expression 

(f  (call/mp  (lambda  (pi)  a)) 

(call/mp  (lambda  (p2)  b)) 

(call/mp  (lambda  (p3)  c))) 

Because  the  evaluation  of  expressions  a,  b,  and  c  occurs  sequentially,  the  illustration  used 
in  figure  2.1  would  be  misleading  for  this  example.  Assuming  a  left-to-right  evaluation 
order  among  function  arguments,  the  evaluation  order  is  clearly  seen  when  the  expression 
is  represented  in  continuation  passing  style  (CPS).  Assuming  k  is  the  continuation  for  the 
entire  expression,  the  cps-converted  expression  would  be 

(a’  (lambda  (vl)  (b’  (lambda  (v2)  (c’  (lambda  (v3)  (f’  vl  v2  v3  k))))))) 
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Figure  2.4:  A  sequential  representation  of  function  application  using  CPS 

where  a’,  b’,  c’,  and  f  ’  are  the  cps-converted  versions  of  a,  b,  c,  and  f,  respectively. 

This  might  be  illustrated  graphically  as  in  figure  2.4,  Each  continuation  is  a  procedure 
with  a  single  input  port.  When  a  value  is  sent  to  the  port,  the  continuation  is  invoked. 
This  is  exactly  the  behavior  caused  by  the  capture  of  the  continuation  using  call/cc  and 
the  invocation  of  that  continuation.  This  is  yet  another  reason  that  we  are  convinced  that 
ports  are  the  logical  extension  of  continuations  in  a  parallel  Scheme. 

In  a  sequential  computation,  such  as  the  one  pictured  above,  each  function  has  only 
one  input  multi-port.  When  a  value  is  thrown  to  that  port,  a  new  thread  computing 
the  function  call  is  created.  This  is  not  the  case  when  call/cc  is  used  in  a  sequential 
language.  Thus,  one  way  to  simulate  the  effect  of  invoking  a  continuation  by  throwing  to 
a  port  is  to  kill  off  the  throwing  process.  Specifically,  the  expression 

(call/cc  f) 

in  ordinary  (sequential)  Scheme  is  equivalent  to 

(call/mp  (lambda  (p)  (f  (lambda  (v)  (throw  p  v)  (die))))) 

in  P scheme. 

Psclieme  provides  all  features  in  Scheme  except  call/cc  and  first  class  continuations. 
When  a  Psclieme  expression  contains  none  of  the  six  parallel  constructs  described  above, 
its  semantics  is  the  same  as  Scheme’s.  That  is,  expressions  are  evaluated  in  applica¬ 
tive  order,  and  the  relative  evaluation  order  among  the  arguments  in  a  function  call  is 
unspecified. 
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As  mentioned  above,  Pscheme’s  view  of  computation  is  a  dynamic  data-flow  diagram. 
Data  streams  may  flow  between  nodes  as  thread  execution  is  ordered  through  exclusive 
functions.  In  Scheme  or  other  sequential  applicative  languages,  the  sequential  execution 
is  actually  a  depth  first  search  with  a  post  order  traversal  of  the  data-flow  diagrams  (this 
is  what  the  CPS  transform  does).  First  class  continuations  enable  us  to  jump  anywhere  in 
the  traversal.  On  the  other  hand,  in  Pscheme,  first  class  ports  allow  us  to  throw  elements 
to  any  link,  or  disconnect  a  link  (close  a  single  port),  which  gives  us  more  control  of  the 
data-flow  diagram. 
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Chapter  3 


Programming  in 


Pscheme 


In  this  chapter,  we  show  how  to  program  in  Pscheme.  Examples  will  explain  how  ports  and 
exclusive  functions  can  represent  synchronization,  blocking,  pipelining  style  parallelism 
and  ordering. 

Through  various  programming  styles,  we  have  built  higher  level  control  facilities  such 
as  semaphores,  Ada-like  rendezvous  and  Multilisp  futures.  Parallel  data  structures  such 
as  shared  atomic-accessed  variables  and  FIFO  queues  have  also  been  implemented.  From 
theses  examples,  we  can  see  that  Pscheme  gives  us  various  aspects  of  parallelism,  and 
enables  us  to  apply  different  programming  paradigms  in  one  language. 

The  chapter  is  also  an  experiment  to  see  the  feasibility  of,  and  limits  to,  representing 
control  in  a  parallel  applicative  language.  We  claim  that  ports  provide  a  control  mech¬ 
anism  as  general  as  continuations  in  sequential  languages.  Our  experience  also  suggests 
that  they  are  easily  understood  by  a  programmer  familiar  with  call/cc  and  continuation- 


based  codes. 


(define  (sum  m  n) 

(cond  ((eq?  m  n)  m) 

((eq?  n  (+  1  m))  (+  m  n)) 

(else  (let  ((x  (/  (+  m  n)  2))) 

(pcall  +  (sum  m  x)  (sum  (+  1  x)  n)))))) 


Figure  3.1:  Parallel  sum 


3.1  Divide  and  Conquer  —  Horizontal  Paral¬ 
lelism 

A  typical  case  of  parallel  execution  is  to  evaluate  argument  expressions  of  a  function  call 
in  parallel,  that  is,  to  exploit  horizontal  parallelism.  Problems  that  can  be  solved  with 
a  recursive  divide- and- conquer  algorithm  can  usually  be  parallelized  in  this  way.  For 
example,  computing  the  sum  of  all  numbers  from  1  to  n  (as  in  figure  3.1),  or  sorting  a  list 
of  numbers  with  the  quicksort  algorithm  (as  in  figure  3.2). 

Horizontal  parallelism  is  often  seen  in  other  parallel  languages.  Programs  with  hori¬ 
zontal  parallelism  are  easy  to  understand  and  are  usually  used  to  measure  the  performance 
speedup. 

3.2  Using  pcall  and  Ports  as  a  Synchronizing 
Mechanism 

Divide  and  conquer  programs,  as  shown  above,  use  pcall  to  create  parallel  threads.  But 
at  the  same  time,  pcall  also  creates  synchronous  multiple  ports  so  that  subsequent  pairs  of 
arguments  can  synchronize  and  reactivate  the  function  body  evaluation.  The  samefringe 
example  in  chapter  2  shows  such  use  of  pcall.  This  enables  us  to  build  synchronizing 
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(define  (qs  num-list) 

(if  (null?  num-list) 
nil 

(if  (null?  (cdr  num-list)) 
num-list 

(let  ((pivot  (car  num-list))) 

(let  ((left-right  (partition  pivot  (cdr  num-list)  nil  nil))) 
(pcall  append 

(qs  (car  left-right)) 

(cons  pivot  (qs  (cdr  left-right))))))))) 

(define  (partition  p  1  1-buf  r-buf) 

(if  (null?  1) 

(cons  1-buf  r-buf) 

(letrec  ((x  (car  1))) 

(if  «  x  p) 

(partition  p  (cdr  1)  (cons  x  1-buf)  r-buf) 

(partition  p  (cdr  1)  1-buf  (cons  x  r-buf)))))) 


Figure  3.2:  Quicksort 
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facilities  using  pcall. 

Ada’s  rendezvous  is  a  synchronous  communication  in  that  the  sender  and  the  receiver 
of  a  message  have  to  synchronize  before  further  execution.  The  rendezvous  is  itself  a 
function  call  with  the  arguments  passed  by  the  sender.  The  return  value  of  the  function 
call  is  actually  the  return  value  of  the  rendezvous  expression  for  both  the  sender  (the 
entry  call  expression)  and  the  receiver  (the  accept  expression). 

We  can  easily  build  an  Ada-like  rendezvous  facility  in  Pscheme.  The  expression 
(make-rendezvous  f )  returns  what  we  call  a  rendezvous  object  r  to  serve  as  an  entry  of  a 
task  in  Ada.  Tasks  communicating  through  a  rendezvous  in  Ada  are  just  Pscheme  sequen¬ 
tial  threads  in  which  the  rendezvous  object  is  visible.  A  thread  evaluating  ((r  ’send) 
argument -list)  causes  f  to  be  applied  with  argument -list.  It  synchronizes  with  the 
expression  (r  ’accept)  in  another  thread.  Both  the  send  and  the  accept  expression 
returns  the  result  of  applying  f  with  argument -list. 

Implementation  of  rendezvous  is  shown  as  in  figure  3.3.  The  idea  is  for  the  message 
sender  and  receiver  to  capture  their  current  single-ports  and  pass  them  to  a  pair  of  synchro¬ 
nizing  input  ports  to  a  pcall’ed  function,  which  evaluates  (apply  f  arguments -list) 
and  then  passes  the  result  back  to  the  sender  and  receiver  respectively. 

The  (make-rendezvous  f)  expression  does  everything.  It  defines  the  send  and 
accept  operations  in  the  binding  of  rendezvous,  and  it  creates  the  synchronizing  mech¬ 
anism  by  pcall’ing  a  function  that  takes  the  sender’s  thunk  and  receiver’s  current  port 
as  arguments.  This  function  simply  invokes  the  thunk  and  sends  the  result  back  to  the 
receiver.  What  the  thunk  does  is  to  apply  f  with  arguments  provided  by  the  sender,  and 
then  pass  the  result  back  to  sender.  The  pcall’ed  function  never  returns  because  only 
the  threads  for  the  sender  and  the  receiver  need  to  be  reactivated. 

We  also  utilize  the  pcall  mechanism  to  return  the  created  rendezvous  object  to  where 
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(define  (make-rendezvous  f) 

(let  ((portl  nil)  (port2  nil)) 

(let  ((rendezvous 

(lambda  (op) 

(if  (eq?  op  ’send) 

(lambda  (arg) 

(call/sp  (lambda(p) 

(throw  portl  (lambdaQ  (throw  p  (f  arg)))) 
(die)))) 

;else  op  =  ’accept 

(call/sp  (lambda(p)  (throw  port2  p)  (die))))))) 

(call/sp 

(lambda  (result) 

(pcall  (lambda  (sender-thunk  receiver-port) 

(throw  receiver-port  (sender-thunk)) 

(die)) 

(call/mp  (lambda(p)  (set!  portl  p)  (lambdaQ  rendezvous))) 
(call/mp  (lambda(P)  (set!  port2  p)  result)))))))) 


>  (let  ((r  (make-rendezvous  (lambda(x)  x)))) 
(pcall  cons 

(r  ’accept) 

(begin  ...  ((r  ’send)  1)  ...  2))) 

==>  (1  •  2) 


Figure  3.3:  Implementation  of  Ada- like  rendezvous 
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make-rendezvous  is  called  by  wrapping  the  rendezvous  object  in  a  thunk  and  passing  the 
thunk  to  the  current  single  port  of  the  make-rendezvous  expression. 

The  short  example  in  figure  3.3  shows  how  to  perform  synchronous  communication 
among  threads  in  Pscheme. 

3.3  Blocking  and  Locking 

In  Pscheme,  the  basic  blocking  mechanism  is  provided  by  exclusive  functions.  As  has  been 
stated  in  previous  chapters,  it  helps  control  the  ordering  of  threads  or  data  elements.  It 
also  provide  exclusive  access  to  shared  variables.  To  guarantee  mutual  exclusion,  an 
alternative  approach  is  to  use  semaphores  to  protect  critical  sections. 

In  Pscheme,  binary  semaphores  can  be  implemented.  That  is,  only  one  thread  can 
execute  the  critical  section.  Subsequent  access  requests  (that  is,  the  “P”  operations)  will 
be  queued  up  and  satisfied  later.  To  implement  binary  semaphores  in  Pscheme,  the  idea 
is  also  to  utilize  pcall,  call/sp  and  call/mp. 

In  figure  3.4,  (make-semaphore)  returns  a  semaphore  s  such  that  a  thread  evaluating 
(s  ’p)  performs  the  “P”  operation  on  s,  and  returns  the  symbol  ’dontcare.  The  call 
(s  ’v)  performs  the  “V”  operation  and  returns  ’dontcare  as  well. 

s  is  created  by  pcall’ing  a  function  f  of  two  parameters  next-process  and 
release-signal.  What  the  function  f  does  is  simply  to  invoke  the  thunk  next-process. 
Any  thread  evaluating  (s  ’p)  captures  its  current  single-port  p,  passes  p  to  entry-port 
(the  port  associated  with  next-process),  and  then  dies.  Evaluating  (s  ’v)  throws 
’dontcare  to  exit-port  (the  port  associated  with  release-signal),  and  thus  synchro¬ 
nizes  with  the  requesting  thread  (the  thread  evaluating  (s  ’p))  through  the  pcall’ed 
function  f.  s  is  initialized  by  having  the  ’dontcare  symbol  queued  in  exit-port  so  that 
the  first  thread  calling  (s  ’p)  will  not  block. 
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(define  (make-semaphore) 

(let  ((entry-port  nil)  (exit-port  nil)) 

(let  ((semaphore 

(lambda(op) 

(if  (eq?  op  ’p) 

(call/ sp 
(lambda(p) 

(throw  entry-port 

(lambdaQ  (throw  p  ’dontcare)  (die))) 

(die))) 

;else  op  =  ’v 

(throw  exit-port  ’dontcare))))) 

(pcall  (lambda  (next-process  release-signal)  (next-process)) 
(call/mp  (lambda(p)  (set!  entry-port  p) 

(lambdaQ  semaphore))) 

(call/mp  (lambda(p)  (set!  exit-port  p) 

(throw  exit-port  ’dontcare))))))) 


Figure  3.4:  Implementation  of  semaphores 
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Both  semaphores  and  exclusive  functions  provide  exclusive  access.  But,  in  our  current 
implementation,  exclusive  functions  cause  busy-waiting  blocking,  while  semaphores  cause 
back-off  blocking,  since  threads  making  access  requests  have  to  idle  themselves  before 
sending  their  current  ports  to  the  request  handler.  In  the  case  of  a  large  critical  section 
and  high  contention,  semaphores  perform  better  than  exclusive  functions.  On  the  other 
hand,  exclusive  functions  should  be  light  weight  activations  in  order  to  perform  well. 
Our  current  implementation  puts  each  created  thread  in  a  global  shared  ready  queue, 
and  several  virtual  processors  will  execute  them.  If  this  model  is  not  on  a  platform  that 
supports  fine  grain  parallelism,  the  back-off  blocking  will  take  some  turn-around  time, 
but  the  total  running  time  should  be  close  to  the  busy-waiting  model.  When  we  discuss 
our  implementation  in  chapter  6,  we  will  talk  about  this  in  more  detail. 

3.4  Shared  Mutable  Objects  and  Atomic  As¬ 
signments 

Scheme  is  not  a  purely  functional  language  because  any  variable  can  be  modified  through 
the  use  of  set!.  In  the  parallel  extension  Pscheme,  the  non-atomic  set!  on  a  shared 
mutable  variable  may  cause  nondeterministic  results.  It  is  usually  suggested  that  pro¬ 
grammers  are  responsible  to  guarantee  atomic  access  on  shared  variables  in  a  parallel 
language  (e.g.  by  using  locks  explicitly),  but  it  is  also  usually  desirable  for  a  language 
to  provide  parallel  data  structures  with  atomic  access  operations.  For  the  case  of  shared 
mutable  variables,  we  need  atomic  reads  and  writes. 

One  way  to  implement  shared  mutable  variables  in  Pscheme  is  to  use  exclusive  func¬ 
tions  to  handle  all  update  requests.  The  simplest  approach  is  to  devote  an  exclusive 
function  for  each  shared  mutable  variable  as  in  the  first  implementation  shown  in  fig¬ 
ure  3.5.  Of  course,  if  a  shared  variable  is  guaranteed  to  be  updated  only  in  a  piece  of 
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exclusive  codes,  set!  can  be  used  directly. 


The  other  way  is  to  use  the  pcall  mechanism  to  serialize  all  update  requests  as  is  done 
in  the  semaphore  implementation.  The  second  implementation  in  figure  3.5  illustrates  this 
idea.  The  current  value  is  stored  in  the  value  port,  and  the  access  requests  are  queued 
within  the  request  port.  The  request  is  handled  by  the  function  monitor.  It  either  passes 
the  current  value  v  to  the  read  requester  and  restores  v  back  to  the  current  value  port, 
or  stores  a  new  value  v'  in  the  current  value  port.  The  identifications  of  the  request  port 
and  the  current  value  port  are  not  stored  globally.  Instead,  they  are  passed  as  parameters 
each  time  an  access  request  is  issued. 


The  interesting  thing  about  doing  this  is  that  the  implementation  needs  no  set! 
or  any  assignment  operator,  and  can  be  used  to  represent  assignment  in  languages  not 
supporting  assignment.  It  demonstrates  the  fact  that  ports  are  expressive  enough  to 
describe  assignment.  However,  it  also  shows  the  fact  that  a  control  mechanism  based  on 
ports  and  threads  violates  referential  transparency  in  parallel  programs  as  is  the  case  will 
continuations  in  sequential  programs. 


As  shown  in  the  figure  3.5,  (make-cell  val)  creates  a  cell  containing  a  request  port 
and  a  value  port.  A  read  request  is  of  the  pattern  (’read  requester-current-port 
value-port) ,  and  an  assign  request  is  of  the  pattern  ( ’  assign  new-value  value-port) . 
On  a  read  request,  the  cell  passes  the  current  value  to  the  requester’s  current  port  as  well 
as  the  value  port.  On  an  assign  request,  the  cell  simply  passes  the  current  value  to  the 
value  port. 
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(define  (read  x)  (x  ’read))  ;  x  is  a  cell 

(define  (assign  x  v)  ((x  ’assign)  v)) 

;  first  implementation  of  stores  with  atomic  accesses 

(define  (cell  v) 

(exclusive 

(lambda(op) 

(case  op 
((read)  v) 

((assign)  (lambda  (new-v)  (set!  v  new-v))))))) 

; second  implementation  of  stores  with  atomic  accesses 

(define  (make-cell  val) 

(let  ((monitor 

(lambda  (pi  p2) 

(if  (list?  pi)  ;handle  access  requests 

(if  (eq?  (car  pi)  ’read)  ;pl=  request,  p2=  current  value 
(begin  (throw  (cadr  pi)  p2) 

(throw  (caddr  pi)  p2) 

(die)) 

;else  (car  pi)  =  ’assign 

(begin  (throw  (caddr  pi)  (cadr  pi)) 

(die))) 

;else  create  the  monitor  and  return 
(lambda(op) 

(case  op  ;  pi  =  request  port,  p2  =  value  port 

( (read) 

(callsp  (lambda(p)  (throw  pi  (list  ’read  p  p2)) 

(die)))) 

((assign) 

(lambda(v)  (throw  pi  (list  ’assign  v  p2))  v) ))))))) 

(pcall  monitor 

(callmp  (lambda(p)  p))  ; create  request  port 

(callmp  (lambda(p)  (throw  p  p)  val)))))  ; create  value  port  and 

; initialize  with  val 

Figure  3.5:  Two  implementations  of  stores  supporting  atomic  access 
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3.5  Shared  FIFO  Queues 


Pscheme  supports  order  preservation  of  messages,  data  elements  or  thread  execution  1 
through  first  class  ports.  The  ordering  is  guaranteed  implicitly  because  of  the  nature 
of  ports.  But,  sometimes  it  may  also  be  desirable  to  provide  explicit  ordering,  that 
is,  FIFO  queues  with  atomic  access  that  can  be  shared  by  multiple  threads.  Scheme’s 
management  of  data  structures  is  mostly  list-based,  so  are  most  applicative  languages 
providing  list  constructors  (such  as  ML  [50],  Haskell  [37],  etc.)  because  the  LIFO  feature 
fits  well  in  recursive  programming.  However,  if  Pscheme  is  to  support  other  parallel 
programming  paradigms,  lists  or  cons-cells  may  not  be  sufficient  for  a  program  that, 
for  example,  processes  a  stream  of  data  and  stores  them  in  the  order  they  arrive.  This 
motivates  our  implementation  of  shared  FIFO  queues. 

The  idea  is  similar  to  the  implementation  of  atomic  assignments,  that  is,  to  use  ex¬ 
clusive  functions  to  serialize  requests.  The  internal  structure  of  a  FIFO  queue  is  a  linked 
list  with  the  links  implemented  with  free  variable  pointers  in  a  closure.  (Pscheme  does 
not  provide  set-car!  and  set-cdr!.)  Enqueue  and  Dequeue  are  both  constant  time 
operations  containing  several  function  calls.  Since  this  is  only  a  sequential  FIFO  mecha¬ 
nism  under  the  protection  of  exclusive,  we  may  use  it  without  exclusive  as  long  as  it 
is  guaranteed  to  be  accessed  only  by  exclusive  functions. 

In  fact,  queues  are  often  used  in  stream-based  programs  as  a  local  data  structure  of  a 
particular  exclusive  function  which  processes  streams.  Later  on  in  following  sections,  we 
will  see  larger  examples  using  the  FIFO  queues  to  store  elements  in  FIFO  order. 


1For  example,  keeping  throwing  values  to  a  particular  multiport  creates  waves  of  threads  executing  the 
same  code.  These  threads  can  be  ordered  through  the  use  of  exclusive  functions. 
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(define  (new-q) 

(let  ((hd  nil)  (tl  nil)) 

(exclusive 
(lambda  (op) 

(case  op 

((deq)  (let  ((elt  (hd  ’val))) 

(set!  hd  (hd  ’next)) 
elt)) 

((enq)  (lambda  (v) 

(let  ((new  (new-q-elt  v  nil))) 

(if  (null?  hd)  ; queue  empty 

(set!  hd  (set!  tl  new))  ; initialize  with  1st  elt 
(begin 

((tl  ’set-next)  new) 

(set!  tl  new)))))) 

((empty?)  (null?  hd)))))) 

(define  (new-q-elt  val  next) 

(lambda(op) 

(case  op 

((val)  val) 

((next)  next) 

((set-next)  (lambda(v)  (set!  next  v)))))) 


(define  (dequeue  q)  (q  ’deq))  ;  return  fisrt  element 

(define  (enqueue  q  v)  ((q  ’enq)  v)  ;  return  v 

(define  (queue-empty?  q)  (q  ’empty))  ;  return  #t  or  #f 


Figure  3.6:  Implementation  of  shared  FIFO  queues 
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3.6  Futures 


Multilisp’s  future  is  a  powerful  construct  that  allows  programmers  to  specify  parallelism 
in  very  fine  granularity.  Programmers  have  sufficient  control  of  what  jobs  are  to  spawn 
for  parallel  execution  without  worrying  about  how  to  get  back  the  result  of  the  spawned 
task. 

Specifically,  in  multilisp,  (future  exp)  spawns  a  new  process  evaluating  exp,  and 
returns  a  future  object,  which,  when  passed  to  a  strict  function,  gives  the  evaluated  value 
v  of  exp.  If  the  evaluation  of  exp  is  not  complete,  the  function  waits  for  its  completion. 
The  future  object  will  always  have  the  value  v  once  the  evaluation  of  exp  is  complete 
(supposing  no  interaction  with  continuations). 

In  our  Pscheme  implementation  of  futures,  the  value  of  a  future  object  needs  to  be 
explicitly  retrieved  by  applying  the  function  future-val  to  the  future  object.  Figure  3.7 
shows  the  implementation. 

(future  exp)  initiates  a  new  thread  evaluating  exp,  and  returns  a  future  object  /, 
which  can  be  stored  or  passed  to  other  functions.  A  strict  operator  needing  exp’s  value 
v  can  call  (future-val  f),  which  returns  v  immediately  if  the  evaluation  of  exp  has 
completed,  or  blocks  until  v  is  available.  Once  the  evaluation  of  exp  is  complete,  the 
result  value  is  committed  as  the  value  of  this  future  object. 

Similarly,  a  future  object  can  be  created  by  pcall’ing  a  function  /  of  two  parame¬ 
ters  val  and  req,  representing  the  future  value  and  the  value  requester’s  current  port, 
respectively.  Before  the  future  value  v  is  available,  there  might  be  several  requests  for  it, 
so  /  performs  (throw  req  val),  and  then  throws  val  back  to  the  value  port  to  synchro¬ 
nize  with  subsequent  requests.  The  short  example  in  figure  3.7  shows  how  to  speed  up 
computation  in  call-by- value  languages. 
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(def-syntax  (future  exp) 

(make-future  (lambdaQ  (call/sp  (lambda(p)  exp))))  ;  not  re-entrant 

(define  (future-val  future-object)  (future-object)) 

(define  (make-future  thunk) 

(call/ sp 

(lambda  (return) 

(letrec 

((value  nil)  (touched?  nil)  (val-port  nil)  (req-port  nil) 

(future-obj  (lambdaQ 

(if  touched? 
value 

(call/sp  (lambda(p)  (throw  req-port  p)  (die))))))) 
(pcall  (lambda  (val  req)  (throw  req  val)  (throw  val-port  val)  (die)) 
(call/mp  (lambda  (p)  (set!  val-port  p) 

(set!  value  (thunk)) 

(set!  touched?  #t) 
value) ) 

(call/mp  (lambda  (p)  (set!  req-port  p) 

(throw  return  future-obj ) 

(die)))))))) 


>  (let  ((x  (future  (largest-prime-less-than  1000)))) 
(  ....  (future-val  x)  . . . ) 


Figure  3.7:  Implementation  of  Multilisp  futures 
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3.7  Stream-based  Programming  and  Vertical 
Parallelism 

Most  parallel  applicative  languages  exploit  parallelism  through  a  fork-join  structure  such 
as  Pscheme’s  pcall  demonstrates.  It  is  sometimes  necessary  for  a  program  to  specify 
vertical  parallelism,  a  parallelism  between  a  producer  and  a  consumer.  For  example,  we 
may  want  to  evaluate  the  function  body  and  the  function’s  argument  expression  in  parallel 
if  the  argument  value  is  not  needed  immediately  by  the  body.  Futures,  as  mentioned 
above,  can  satisfy  our  need.  In  Pscheme,  we  may  say 

((lambda(x)  ....  (future-val  x)  ...)  (future  exp)) 

to  gain  vertical  parallelism. 

Furthermore,  we  can  generalize  this  idea  to  express  pipelining  style  parallelism.  This 
is  usually  hard  to  express  in  most  parallel  applicative  languages.  Even  with  futures 
provided  as  in  Multilisp,  we  still  need  exclusive  access  facilities  to  control  the  order  of 
racing  threads,  but  exclusive  functions  are  not  supported  in  Multilisp. 

On  the  other  hand,  pipelining  is  a  kind  of  parallelism  that  is  often  seen  in  real  world 
activities.  Although  there  are  attempts  to  model  streams  in  applicative  languages  [19], 
most  are  only  sequential  simulation  of  data  streams,  and  fail  to  really  provide  pipelining 
parallelism  in  a  parallel  language.  The  reason  may  be  that  current  Von  Neumann  archi¬ 
tectures  force  the  implementation  of  pipelining  parallelism  to  be  simulated  by  light  weight 
processes,  which  is  not  considered  to  have  good  enough  performances.  We  think  that  if 
massive  fine-grained  parallelism  or  data-flow  computing  is  supported  by  the  underlying 
platform,  it  is  worth  trying  to  express  pipelining  at  the  language  level.  In  this  section,  we 
show  how  to  develop  stream-based  programs  with  pipelining-style  parallelism  in  Pscheme, 
and  demonstrate  another  dimension  of  its  expressiveness. 
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3.7.1  A  Small  Example  —  Stream  of  Fibonacci  Numbers 

The  idea  of  representing  computation  in  terms  of  streams  comes  from  the  data-flow  lan¬ 
guages  such  as  VAL  [1],  Id  [7],  Lucid  [63]  and  so  on. 

For  example,  to  calculate  Fibonacci  numbers  in  Lucid,  we  observe  that  the  infinite 
Fibonacci  stream  f  satisfies  the  recurrence  equation: 

f  =  1  ::  1  ::  (add_list  f  (cdr  f)) 

The  stream  f  can  be  generated  by  an  adding  cell  with  feedback  wires  to  its  two  inputs  as 
shown  in  Figure  3.8. 

In  P scheme,  the  “stream”  data  type  is  not  supported.  In  order  to  model  streams 
in  Psclieme,  multiports  are  used.  A  stream  is  expressed  by  throwing  data  elements  to 
a  multiport.  Figure  3.9  shows  how  to  interpret  the  conceptual  model  of  computing  the 
Fibonacci  stream  in  figure  3.8. 

The  input  arcs  are  captured  by  multi-ports  in  Psclieme,  and  the  adding  cell  is  simply 
a  function  which  also  counts  the  cycle  number  n  in  order  to  compute  up  to  fib(n).  Since 
the  two  input  ports  are  not  visible  inside  add,  we  use  two  global  variables  next-x  and 
next-y  to  store  the  captured  ports.  The  initial  values  (the  two  l’s)  are  sent  to  the  ports 
and  added  together  after  next-x  and  next-y  are  initialized  properly,  add  only  returns 
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(define  (fib  n) 

(let  ((counter  1)  (next-x  nil)  (next-y  nil)) 

(let  ((add  (lambda  (x  y) 

(if  (eq?  n  counter) 
x 

(begin  (set!  counter  (+  1  counter)) 
(throw  next-y  x) 

(throw  next-x  (+  x  y)) 
(die)))))) 

(pcall  add 

(call/mp  (lambda(p)  (set!  next-x  p)  1)) 
(call/mp  (lambda(p)  (set!  next-y  p)  1)))))) 


Figure  3.9:  Implementation  of  the  Fibonacci  stream 


the  sum  when  the  counter  reaches  n.  Otherwise,  it  dies.  Notice  that  die  does  not  have  to 
be  exclusive  because  the  barrier  synchronization  caused  by  pcall  naturally  enforces  the 
exclusiveness  of  the  application  of  add. 

The  use  of  set!  in  fib  is  awkward  and  unnecessary.  The  state  of  the  computation 
that  counter  represents  can  be  passed  as  parameters  as  are  x  and  y.  Therefore,  we  can 
have  a  more  functional  adding  cell  as  shown  in  figure  3.10. 

Streams  are  often  processed  through  multiple  stages.  In  the  following  sections,  we  will 
show  larger  examples  that  illustrate  pipelining  style  parallelism. 

3.7.2  Filters 

Basic  stream  programming  applies  a  few  processing  stages  to  an  input  stream  and  gener¬ 
ates  a  result  stream.  These  stages  work  concurrently  in  the  pipelining  model,  and  serve 
as  filters  of  their  inputs.  We  give  examples  in  this  section  of  how  to  implement  filters  and 
data  streams  to  express  pipelining  style  parallelism. 
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(define  (fib  n) 

(let  ((next-x  nil)  (next-y  nil)  (next-c  nil)) 

(let  ((add  (lambda  (x  y  counter) 

(if  (eq?  n  counter) 
x 

(begin  (throw  next-c  (+  1  counter)) 
(throw  next-y  x) 

(throw  next-x  (+  x  y)) 
(die)))))) 

(pcall  add 

(call/mp  (lambda(p)  (set!  next-x  p)  1)) 
(call/mp  (lambda(p)  (set!  next-y  p)  1)) 
(call/mp  (lambda(p)  (set!  next-c  p)  1)))))) 


Figure  3.10:  Implementing  the  Fibonacci  stream  without  using  set! 


Since  Pscheme  is  a  call-by- value  language,  a  filter  receiving  an  input  stream  will  be 
represented  as  a  one-argument  function.  Each  activation  of  the  function  is  the  process 
an  element  from  the  stream.  Filtering  a  stream  requires  reactivating  the  function  for 
each  stream  element  by  throwing  the  element  to  the  port  associated  with  the  function 
argument  position. 


The  Relationship  between  Lists  and  Streams 

The  input  stream  is  often  generated  from  data  structures  such  as  lists,  and  the  output 
stream  is  collected  and  stored  into  data  structures  as  well.  Since  the  filter  function  does 
not  know  when  there  is  no  element  coming  (it  is  decided  by  the  caller  or  thrower),  a 
special  token  is  often  passed  to  the  function  to  signal  the  end  of  the  stream. 

Figure  3.11  shows  how  lists  and  streams  of  numbers  can  be  transformed  to  each 
other  via  the  function  list->stream  and  stream->list.  (stream->list  eos)  returns 
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(define  (list->steam  1  receiving-port  eos) 

(if  (null?  1) 

(begin  (throw  receiving-port  eos)  (die)) 

(begin  (throw  receiving-port  (car  1)) 

(list->stream  (cdr  1)  receiving-port  eos)))) 

(define  (stream->list  eos) 

(let  ((buf  nil)) 

(exclusive 

(lambda(v)  (if  (eq?  v  eos) 
buf 

(begin  (set!  buf  (cons  v  buf))  (die))))))) 
Figure  3.11:  Conversion  between  lists  and  streams 


a  stream-to-list  converter,  which  is  a  one-argument  exclusive  function  that  takes  stream 
elements  as  its  argument  and  stores  them  in  its  internal  buffer.  (The  stream  ends  with 
eos.)  This  converter  finally  returns  a  list  of  elements  in  the  reverse  order. 

Examples  —  Generating  Prime  Numbers 

We  compute  the  prime  numbers  within  a  given  range  by  creating  an  exclusive  function 
which  filters  all  numbers  from  the  input  stream  that  is  a  multiple  of  previous  elements  in 
the  stream.  Any  application  program  that  works  on  primes  can  take  as  input  the  stream 
generated  by  prime-cell. 

If  we  just  want  to  collect  all  primes  less  than  or  equal  to  n,  we  may  use  stream->list 
to  return  the  prime  list  as  shown  in  figure  3.12. 

In  traditional  applicative  call- by- value  languages,  the  list  of  primes  in  some  given 
range  will  only  be  available  when  all  elements  have  been  computed.  Pscheme  lifts  the 
restriction.  The  first  prime  is  available  for  use  even  when  the  second  prime  is  still  being 
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(define  (num-stream  a  b  port) 

(if  (>  a  b) 

(begin  (throw  port  ’eos)  (die)) 

(begin  (throw  port  a)  (num-stream  (+  1  a)  b)))) 

(define  (multiple?  val  num-list)  ;  to  see  if  val  is  a  multiple  of 

(if  (null?  num-list)  ;  a  number  in  num-list 

#f 

(if  (eq?  (modulo  val  (car  num-list))  0) 

#t 

(multiple?  val  (cdr  num-list))))) 

(define  (prime-cell  out-port) 

(let  ( (prime-num-list  nil)) 

(exclusive 

(lambda(val) 

(if  (eq?  ’eos  val) 

(begin  (throw  out-port  ’eos)  (die)) 

(if  (multiple?  val  prime-num-list) 

(die) 

(begin  (throw  out-port  val) 

(set!  prime-num-list  (cons  val  prime-num-list)) 
(die)))))))) 

(define  (primes  n)  ;  return  a  list  of  primes  <=  n 

( (stream->list  ’eos) 

(call/mp  (lambda  (out-port) 

((prime-cell  out-port) 

(call/mp  (lambda  (p)  (num-stream  2  n  p) ))))))) 


Figure  3.12:  Generating  the  stream  of  prime  numbers 
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computed.  We  can  achieve  liner  grained  pipelining  by  splitting  the  testing  of  multiples 
into  stages,  each  of  which  tests  if  the  input  number  is  a  multiple  of  one  number.  If  the 
underlying  platform  supports  fine  grain  parallelism,  this  is  worth  trying. 

3.8  Applications 

We  present  in  this  section  two  larger  examples  of  stream-based  programming  -  quicksort 
and  merge  sort.  The  idea  is  to  use  any  available  result  as  early  as  possible. 

3.8.1  Quicksort 

Quicksort  is  a  recursive  function  which  partitions  its  input  list  in  two,  sorts  them  recur¬ 
sively,  and  appends  them  to  get  the  result.  We  try  to  parallelize  the  partition  and  the 
two  sub-sorts  so  that  once  a  number  less  than  the  pivot  is  found,  it  is  passed  to  the  sorter 
for  numbers  less  than  the  pivot.  Numbers  larger  than  the  pivot  are  handled  similarly. 

The  whole  sorting  process  is  a  number  of  partitions  that  internally  build  up  a  binary 
search  tree  (BST).  Our  program  views  it  as  a  binary  tree  of  concurrent  partitioning  cells. 
Each  cell  passes  the  input  from  its  parent  to  one  of  its  two  children  depending  on  whether 
it  is  less  or  larger  than  the  pivot.  When  the  input  stream  comes  to  an  end,  we  perform 
a  depth  first  search  with  in  order  traversal  of  this  BST  based  on  the  pivot  value  of  each 
partitioning  cell. 

The  partitioning  cells  in  this  scenario  are  exclusive  functions  in  Pscheme.  The  outgoing 
arcs  of  a  cell  are  actually  ports  visible  in  the  exclusive  function.  Cells  are  created  only 
when  needed.  When  a  cell  is  created,  an  input  arc  to  the  cell  is  also  built  as  a  multiport 
visible  in  its  parent  cell. 

Figure  3.13  shows  the  code  for  our  quicksort.  As  seen  in  the  function  qs,  the  sorting 
starts  with  one  partitioning  cell.  The  end-of-stream  token  is  actually  a  port  where  the 
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(define  (new-partition-cell  pivot  portl  port2) 

(exclusive 
(lambda  (v) 

(if  (number?  v) 

(if  (<  v  pivot) 

(if  (null?  portl) 

((new-partition-cell  v  nil  nil) 

(call/mp  (lambda  (p)  (set!  portl  p)  (die)))) 

(throw  portl  v)) 

(if  (null?  port2) 

((new-partition-cell  v  nil  nil) 

(call/mp  (lambda  (p)  (set!  port2  p)  (die)))) 

(throw  port2  v))) 

;else  v  is  a  port,  so  partition  is  done,  now  collect  results 
(throw  v  (lambda  ()  ;  delay  result  collection 

(pcall 

(lambda  (thl  th2) 

(append  (thl)  (cons  pivot  (th2)))) 

(call/sp  (lambda  (p) 

(if  (null?  portl) 

(lambda  ()  nil) 

(begin  (throw  portl  p)  (die))))) 
(call/sp  (lambda  (p) 

(if  (null?  port2) 

(lambda  ()  nil) 

(begin  (throw  port2  p)  (die))))))))) 

(die)))) 

(define  (qs  1) 

(if  (null?  1) 
nil 

(let  ((partition-cell  (new-partition-cell  (car  1)  nil  nil))) 

( (call/ sp 

(lambda  (result-port) 

(part it ion- cell 

(call/mp  (lambda  (p) 

(list->stream  (cdr  1)  p  result-port)))))))))) 


Figure  3.13:  Stream-based  quicksort 
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result  should  go.  When  a  cell  receives  a  non-number  element,  it  knows  it  is  a  result  port. 
Then,  it  prepares  to  pass  the  sorted  result  (a  list)  to  the  result  port  by  throwing  two 
newly  created  receiving  ports  to  its  children  respectively.  After  receiving  two  sorted  lists 
from  its  children,  the  cell  appends  then  and  passes  the  whole  sorted  list  to  its  parent  cell. 

We  implement  the  result-collecting  protocol  a  little  differently  from  the  above  idea. 
That  is,  the  result  is  collected  lazily.  A  thunk  is  always  passed  to  a  result  port,  and 
calling  it  gives  the  real  sorted  list.  The  reason  to  do  this  is  to  keep  exclusive  functions 
light,  and  avoid  parallelism  or  a  “context  switch”  (that  is,  a  die)  in  the  middle  of  exclusive 
functions’  bodies. 


3.8.2  Merge  Sort 

Another  application  of  pipelining-style  parallelism  is  the  merge  sort.  It  is  natural  to  think 
of  a  cell  merging  two  sorted  input  streams  and  generating  a  sorted  output  stream.  We 
use  a  recursive  function  ms  to  sort  a  vector  of  numbers  as  shown  in  figure  3.14. 

When  a  merging  cell  is  created  in  ms,  it  merges  two  streams  whose  lengths  differ  by 
at  most  one  from  each  other  so  that  pcall  works.  Since  the  pcall  expression  forces  the 
merging  cell  to  take  numbers  from  both  input  ports,  we  buffer  the  larger  one  in  one  of 
the  two  local  queues  for  subsequent  comparisons. 

Figure  3.15  shows  the  implementation  of  a  merging  cell.  In  a  merging  cell,  there  are 
two  local  queues  to  store  elements  from  the  two  input  streams  respectively.  At  least  one 
of  them  must  be  empty.  The  function  call  (compare-deq  value  queue  port  all)  keeps 
comparing  value  with  numbers  in  queue,  throwing  the  smaller  to  port  until  val  is  thrown 
or  queue  is  empty.  If  the  flag  all  is  true,  everything  will  be  thrown  to  port. 
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(define  (merge-sort  vet) 

( (stream->list  ’eos) 

(call/mp  (lambda(p)  (ms  vet  0  (-  (vector-length  vet)  1)  p))))) 

(define  (ms  num-vet  il  i2  out-port) 

(cond  ((eq?  il  i2) 

(throw  out-port  (vector-ref  num-vet  il)) 

(throw  out-port  ’eos) 

(die)) 

((eq?  i2  (+  1  il)) 

(let  ((nl  (vector-ref  num-vet  il)) 

(n2  (vector-ref  num-vet  i2))) 

(if  (<  nl  n2) 

(begin  (throw  out-port  nl)  (throw  out-port  n2)) 

(begin  (throw  out-port  n2)  (throw  out-port  nl))) 

(throw  out-port  ’eos) 

(die))) 

(else 

(let  ((13  (/  (+  il  i2)  2))) 

(pcall  (new-ms-cell  out-port) 

(call/mp  (lambda(p)  (ms  num-vet  il  i3  p))) 

(call/mp  (lambda(p)  (ms  num-vet  (+  i3  1)  i2p) ))))))) 

;  auxiliary  functions  used  by  new-ms-cell 

(define  (deq-all  q  p) 

(if  (q  ’empty?) 
nil 

(begin  (throw  p  (q  ’deq))  (deq-all  q  p)))) 

(define  (compare-deq  val  queue  port  all) 

(if  (queue  ’empty?) 

(if  all  (throw  port  val)  nil) 

(if  (<  val  (queue  ’head)) 

(begin  (throw  port  val) 

(if  all  (deq-all  queue  port)  nil)) 

(begin  (throw  port  (queue  ’deq)) 

(compare-deq  val  queue  port  all))))) 

Figure  3.14:  Stream-based  merge  sort 
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(define  (new-ms-cell  out-port  ql  q2  state) 

(exclusive 
(lambda(nl  n2) 

(cond  ((and  (eq?  nl  ’eos)  (eq?  n2  ’eos)) 

(if  (eq?  state  1) 

(deq-all  ql  out-port) 

(deq-all  q2  out-port)) 

(throw  out-port  ’eos)) 

((eq?  nl  ’eos) 

(if  (eq?  state  1) 

(compare-deq  n2  ql  out-port  #t) 

(begin  (deq-all  q2  out-port)  (throw  out-port  n2))) 

(throw  out-port  ’eos)) 

((eq?  n2  ’eos) 

(if  (eq?  state  1) 

(begin  (deq-all  ql  out-port)  (throw  out-port  nl)) 
(compare-deq  nl  q2  out-port  #t)) 

(throw  out-port  ’eos)) 

((null?  state)  ;  initial  state 

(if  (<  nl  n2) 

(begin  (throw  out-port  nl)  ((q2  ’enq)  n2)  (set!  state  2)) 

(begin  (throw  out-port  n2)  ((ql  ’enq)  nl)  (set!  state  1)))) 

((eq?  state  1) 

((ql  ’enq)  nl) 

(compare-deq  n2  ql  out-port  #f) 

(if  (ql  ’empty?) 

(begin  ( (q2  ’enq)  n2)  (set!  state  2)) 

(die))) 

(else  ;  (eq?  state  2) 

((q2  ’enq)  n2) 

(compare-deq  nl  q2  out-port  #f) 

(if  (q2  ’empty?) 

(begin  ((ql  ’enq)  nl)  (set!  state  1)) 

(die)))) 

(die)))) 

Figure  3.15:  Implementing  the  cell  merging  two  sorted  streams 
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3.9  Remarks 


What  was  the  objective  of  this  chapter?  We  have  shown  that  Pscheme  can  model  different 
kinds  of  parallel  programming  paradigms.  For  programmers  familiar  with  continuation- 
based  programs,  understanding  Pscheme  programs  is  not  harder  because  port  operations 
(call/sp,call/mp  and  throw)  are  similar  to  continuation  operations.  As  a  matter  of 
fact,  it  is  easy  to  program  in  Pscheme  with  the  help  of  our  library  of  higher  order  control 
facilities  and  parallel  data  structures. 

Our  point,  however,  is  not  to  develop  programs  that  can  beat  those  written  in  Mul¬ 
tilisp,  Lucid  or  Actors.  Instead,  we  try  to  reason  about  control  within  the  world  of 
applicative  programming,  and  illustrate  the  fact  that  ports  are  expressive  enough  to  de¬ 
scribe  most  concepts  in  parallel  computation.  Furthermore,  the  concept  of  ports  can  be 
used  to  explain  the  semantics  of  multi-threaded  applicative  languages.  2  We  believe  that 
ports  do  play  the  role  that  continuations  do  in  sequential  languages. 

We  would  also  like  to  point  out  that  Pscheme  does  support  the  declarative  program¬ 
ming  style.  Issues  that  are  easily  solved  by,  for  example,  functional  programs,  should  be 
solved  in  that  way.  But,  programming  in  Pscheme,  we  have  more  alternatives. 

As  for  practical  programming  in  Pscheme,  we  will  give  performance  numbers  for  the 
two  practical  stream-based  examples  quicksort  and  merge  sort  to  see  how  well  they  scale 
up  on  a  Von  Neumann  machine. 
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We  use  the  concept  of  ports  to  develop  Pscheme’s  operational  semantics.  This  can  be  seen  in  chapter 


4. 
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Chapter  4 


Operational 


Semantics 


In  this  chapter,  we  give  a  formal  definition  of  Pscheme.  The  framework  is  an  operational 
semantics  that  can  be  used  to  define  any  multi-threaded  expression  language  with  a  global 
naming  space.  Our  approach  is  similar  to  CML’s  operational  semantics  developed  by 
Reppy  [55]  and  Berry,  Milner  and  Turner  [8]. 

We  first  introduce  an  abstract  reduction  machine  that  illustrates  our  computation 
model.  Then,  we  present  Pscheme’s  operational  semantics  in  terms  of  the  reduction  rules 
on  our  abstract  machine.  As  CPS  is  used  to  define  sequential  languages,  the  reduction 
rules  are  presented  in  port  passing  style ,  which  also  constitutes  the  basis  of  Pscheme’s 
compiler. 

4.1  The  Abstract  Machine 

Our  abstract  machine  contains 

•  a  shared  memory  which  maps  addresses  to  values  in  the  language  domain, 

•  a  shared  queuing  mechanism  which  maps  queue  addresses  to  FIFO  queues,  and 


•  a  set  of  reduction  operations  applicable  to  threads  running  on  it. 


At  any  time,  the  configuration  of  the  abstract  machine  is  represented  by  the  triple: 

( set  of  running  threads,  shared  memory,  shared  queuing  mechanism ) 

The  semantics  of  a  Pscheme  program  is  defined  by  the  initial  and  terminating  configura¬ 
tion  of  the  abstract  machine  with  this  input  program. 

The  first  component  of  the  machine  configuration  is  defined  by  a  set  of  running  threads. 
Each  thread  is  expressed  as  a  triple  (expression,  environment,  port),  where  expression  is 
a  legal  expression  in  Pscheme,  environment  is  a  mapping  from  identifiers  in  a  Pscheme 
program  to  locations  (addresses)  in  the  shared  memory,  and  port  is  a  semantic  abstraction 
to  be  defined  later. 

The  instructions  of  the  abstract  machine  are  the  reduction  operations  between  ma¬ 
chine  configurations.  At  any  time  during  the  program  execution,  only  one  reduction  step 
can  occur,  but  the  one  to  apply  among  all  applicable  reductions  is  selected  nondetermin- 
istically. 

4.2  Meanings  of  Semantic  Notations 

A  thread  [exp,  env,  port]  has  the  meaning  that  a  process  is  currently  evaluating  the 
expression  exp  under  the  environment  env,  and  then  passes  the  result  to  the  port  port, 
port  is  syntactically  a  series  of  operation  tags  separated  by  the  symbol  A  port  can  be 
reified  1  at  the  expression  level,  (and  hence  becomes  an  object  in  the  language  domain  and 
can  be  manipulated  by  the  programmer)  and  stored  in  the  shared  memory.  The  top-level 
port  is  denoted  as  RETURN,  meaning  the  control  point  where  a  program  is  to  return  its 
final  result.  If  a  thread  is  denoted  as  [val,  env,  port]dorie,  it  means  that  it  is  about  to  pass 
JThe  term  reify  comes  from  the  concept  of  computational  reflection  [20,59,49]  It  means  to  make  an 
invisible  meta-level  object  (such  as  a  construct  used  in  the  language  interpreter  or  the  semantic  framework) 
available  or  tangible  at  the  language  level  so  that  the  programmer  can  manipulate  it  directly.  For  example, 
reifying  a  continuation  in  the  semantics  gives  a  one-argument  function  in  the  language. 
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P 
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G 
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QueueAddress  — ►  Queue 

q 

G 
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[*],  [ *\done 

G 

Process 

S 

G 

power  set(Process) 

s[*] 

= 

s  u  {[*]} 

update 

G 

(( Address  — ►  Representation )  X  Address  X  Representation ) 

— ►  ( Address  — ►  Representation ) 

update(M,  a,  v) 

= 

M  —  {(a,Md))f  {(a,u)} 

enqueue(Q ,  q,  v) 

= 

Q  -  {(?,  Qq)}  +  {(q,  enq(Qq ,  w))} 

dequeue(Q ,  qset) 

= 

Q  -  {(q,  Qq)\q  e  qset}  +  {(q,  deq(Qq))\q  G  qset} 

PRIM 

= 

the  set  of  all  identifiers  represting  primitive  functions 

Figure  4.1: 

Notation  definitions  in  Pscheme’s  operational  semantics 

the  evaluated  value  val  to  the  port  port,  and  there  is  no  need  to  evaluate  val  in  env. 

We  use  the  notion  Cl  =>■  C 2  to  denote  an  atomic  reduction  step  in  which  configura¬ 
tion  Cl  is  transformed  to  configuration  C 2.  As  mentioned  above,  the  system  configuration 
will  be  changed  through  only  one  reduction  rule  at  a  time.  If  there  are  more  than  one 
reduction  rules  applicable  to  some  system  configuration,  the  rule  to  apply  is  chosen  non- 
deterministically. 

Figure  4.1  shows  the  notations  (and  their  definitions)  we  will  use  in  our  reduction 
rules.  We  use  the  character  to  match  any  content  in  the  context.  To  simplify  the 
configuration  expression,  we  express  the  set  of  threads  without  the  curly  brackets 
Instead,  for  a  thread  set  S,  S  U  {[*]}  is  expressed  as  N[*].  Also,  representing  the  triple  for 
the  machine  configuration  is  abbreviated  without  the  parentheses  The  definitions  of 
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the  function  enq,  deq  and  power  set  in  figure  4.1  are  not  provided  for  the  obvious  meanings 
of  these  names. 

We  define  a  series  of  reductions  by  the  symbol  =y*  as  follows. 

C  =^*  C  iff  C  =  C  or  3 C"  C  =s>  C"  and  C"  =^*  C 

Definition  of  a  Pscheme  Program 

Let  eval(Exp)  be  the  set  of  all  possible  results  of  evaluating  the  program  expression  Exp , 
and  RETURN  be  the  top-level  output  port.  Then, 

1.  v  G  eval(Exp)  iff 

3 p,S,M,Q,  {[Exp,  {},  RETURN]},  {},  {}  =y*  S[v,p,  RETURN]^,  M,  Q 

2.  eval(Exp)  =  _L  iff 

-.3 v,p,S,M,Q,  {[Exp,  {},  RETURN]},  {},  {}  =y*  S[v,  p,REJ\JRH]done,  M,Q 

4.3  Operational  Semantics  of  Pscheme 

In  this  section,  we  present  the  reduction  rules  for  Pscheme  expressions  on  our  abstract  ma¬ 
chine.  This  is  a  parallel  extension  of  the  sequential  continuation-passing  style  semantics. 
The  continuation  abstraction  in  the  sequential  semantics  is  replaced  by  a  port  abstraction 
which  is  syntactically  composed  of  a  series  of  operation  tags.  We  call  it  the  port  pass¬ 
ing  style  semantics.  As  in  the  case  of  the  continuation  abstraction,  the  port  abstraction 
also  encapsulates  the  information  regarding  the  environment  and  the  computation  that 
follows.  Side  effects  caused  by  destructive  assignments  or  parallel  primitives  are  reflected 
in  the  shared  memory  or  the  shared  queuing  mechanism,  which  are  part  of  the  system 
configuration. 

As  stated  in  previous  chapters,  Pscheme  views  parallel  computation  as  a  dynamic 
data-flow  diagram.  Our  port-passing-style  semantics  described  here  makes  all  links  in 
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the  data-flow  diagram  explicit.  Evaluated  values  are  passed  explicitly  along  the  links  to 
receiving  nodes  just  as  a  thread  passes  a  value  to  its  port  abstraction  in  the  operational 
semantics. 

4.3.1  Basic  Reductions  Regarding  Sequential  Computa¬ 
tion 

Constants  do  not  require  evaluation  and  can  be  passed  to  ports  directly.  Evaluating  an 
identifier  requires  looking  it  up  in  the  environment  to  get  the  address  where  the  value 
of  the  identifier  is  stored.  Conditional  expressions  and  assignment  expressions  have  the 
same  semantics  as  in  CPS  semantics.  Primitive  operators,  if  not  bound  to  new  values, 
are  treated  as  first  class  closures  as  lambda  forms  are. 

Constants  and  Identifiers 

S[K,p,P]  ,  M  ,  Q  =>  S[K,p,P]done  ,M,Q 
S[I,p,P]  ,  M  ,  Q  =>  S[M(pI),p,P]done  ,M,Q 


Conditionals 

5[(if  ei  e2  e3),p,  P]  ,  M  ,  Q  =y  S[e i,  p,  <  branch,  e2,  e3  >::  P]  ,  M  ,  Q 
S[v,  p,  <  branch,  ei,e2  >::  P] 

done  •)  M  ,  Q 

S[ei,  p,  P]  ,  M  ,  Q  if  v  =  nil 
S[e2,p,P]  ,  M  ,  Q  otherwise 

Shared  Memory 

5[(set!  X  Y),p,  P]  ,  M  ,  Q  =y  5[y,  p,  <  assign,  X  >::  P]  ,  M  ,  Q 

5[w,  p,  <  assign,  X  >::  P]<w  ,  M  ,  Q  =>  5[w,  p,  P]  ,  update(M ,  pX,  v)  ,  Q 
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Function  Abstraction 


5[(lainbda  args  body),  p,  P]  ,  M  ,  Q  =>■  S[closure(args,  body,  p),p,  P]done  ,  M  ,  Q 
S[F,  p,  P],  M,  Q  =>■  5'[primop(_F),  P]done ,  M,  Q  if  F  G  PRIM,  and  pF  undefined 


4.3.2  Exclusive  Functions 

A  function  can  be  made  exclusive,  that  is,  at  most  one  activation  of  an  exclusive  function 
can  exist  at  any  time.  When  an  exclusive  function  is  called,  a  shared  lock  and  a  shared 
queue  is  created  to  guarantee  the  exclusiveness  and  to  store  the  calling  requests.  At  the 
same  time,  another  thread  is  created  to  retrieve  arguments  (if  any)  in  the  queue  and  dis¬ 
patch  them  to  the  function.  The  following  rules  show  that  the  primitive  exclusive  can 
only  take  as  its  argument  normal  functions  or  primitive  operators.  No  exclusive  functions 
can  be  passed  to  exclusive. 

N[( exclusive  exp),  p,  P],  M,  Q  =y  S[exp,  p,  <  makeexclusive  >::  P],  M,  Q 

S[dosure{args,body,p),p',<  makeexclusive  >::  P]  done  i  M,  Q  => 

' 

S[exclosure(args,  body,  p,  q,  a),p',  P]d0ne[e%closure(args,  body,  p,  q,  a),p',  <  exgetval,  q  >::  P'jdone, 

<  M  U  (a,  free),  Q  U  (q,  empty)  if  P  =<  Apply  With,  *  >::  P'  or  <  Para  Apply  With,  *  >::  P' 

S[exclosure(args,  body,  p,  q,  a),p',  P]done ,  M  U  (a,  free),  Q  U  (q,  empty)  otherwise 


S[primop(F),  p,  <  makeexclusive  >::  P]  done  i  M,  Q  => 

' 

S[exprimop(F,  q,  a),p,  P]done[exprimop(F,  q,  a),p,  <  exgetval,  q  >::  P']done , 

<  M  U  (a,  free),  Q  U  (q,  empty)  if  P  =<  Apply  With,  *  >::  P1  or  <  Para  Apply  With,  *  >::  P1 

S[exprimop(F,  q,  a),p,  P]done ,  M  U  (a,  free),  Q  U  (q,  empty)  otherwise 
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4.3.3  Function  Application 


Function  application  expressions  are  handled  by  evaluating  the  function  first,  and  then 
evaluating  all  argument  expressions  sequentially  or  in  parallel  depending  on  whether  pcall 
is  used  or  not. 


The  Sequential  Case 

In  a  sequential  function  application  expression,  arguments  are  evaluated  left  to  right. 
When  all  arguments  have  been  evaluated,  the  function  is  called  if  it  is  not  exclusive.  Oth¬ 
erwise,  arguments  are  enqueued  for  exclusive  calls. 

F  {pcall, call/sp, call/mp, throw, exclusive, set!} 

S[(FA1  ■  ■  ■  An),p,  P]  ,  iff,  Q  S[F,p,<  ApplyWith,  •  •  •,  >::  P]  ,  iff,  Q 

S[f,p,  <  ApplyWith,  Ai,  •  •  •,  An  >::  P]done  ,  M  ,  Q 
=y  5[Ai,p,  <  EvalArg,  1,/,  A2,  ■  ■  ■ ,  An  >::  P]  ,  M  ,  Q 

S[vi,p ,  <  EvalArg,*,/,  ui,  •  •  • ,  *y_i,  Ai+1,  ■  ■  ■ ,  An  >::  P]done  ,  M  ,  Q 

=y  S[Ai+1,p,  <  EvalArg,*  +  l,f,v1,  ■  ■  -  ,*y,  At+2,  ■  ■  ■  ,An  >::  P],M,Q,  !<*<**-! 


S[vn,p,  <  EvalArg,  **,/,**!,  •  •  -  ,vn_ t  >::  P] 

done  i  iff,  Q  =y 

' 

S  ,  iff  ,  enqueue(Q ,  q,list(vi  ■  ■  ■  vn))  If  /  =  exprimop(F,  a,  q)  or  exclosure(args,  body,  p,q) 

< 

S[f,p,  <  CallWith,**!,  P]done,  iff,  Q  otherwise 
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The  Parallel  Case 


In  a  pcall  expression,  the  function  expression  is  evaluated  first.  At  the  same  time,  a 
queue  is  built  as  the  implicit  port  for  each  argument  position  so  that  the  evaluated  value 
of  can  be  passed  to.  Then,  all  the  argument  expressions  are  evaluated  in  parallel. 

5[(pcall  F  Ai  ■  ■  ■  An),  p,  P],  iff,  Q  =>■  S[F,  p,  <  Para  Apply  With,  Ai,  ■  ■  ■ ,  An  >::  P],  iff,  Q 

S[f ,  p,  <  ParaApplyWith,  Ai,  •  •  • ,  An  >::  P]<w,  M,  Q 

=>  S[Ax,p,  <  enter,  q1  >]■■■  [ An,  p ,  <  enter,  qn  >][f,p,  <  getval,gi,  P]done , 

iff,  Q  U  {(qi,  empty) 1 1  <  i  <  n} 

5[n,  p,  <  enter,  q  >]done  ,  M  ,  Q  =>•  S  ,  M  ,  enqueue(Q,q ,  u) 


Function  Call 

In  a  function  application  expression,  when  the  function  and  all  the  argument  expressions 
have  been  evaluated,  the  function  call  occurs.  Calling  a  non-exclusive  function  has  the 
same  semantics  as  in  Scheme  except  that  arguments  evaluated  in  parallel  have  to  syn¬ 
chronize  and  then  be  retrieved  from  the  queue  by  a  dispatching  process. 

S\primop(F),p ,  <  CallWith,ui,  P]done  ,  M  ,  Q 

=y  S[OP(F,  list(v  1  •  •  •  vn)),p,  P]done  ,  M  ,  Q 
2 

2 The  formal  definition  of  OP  is  deliberately  eliminated.  In  general,  the  function  OP  applies  a  primitive 
function  to  a  list  of  arguments  and  returns  a  result  value. 
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S[closure(list(xi  ■  ■  -xn),  body ,  p),p',  <  Cal  I  With,  tq,  •  •  • ,  vn  >::  P]done  ,  M  ,  Q 
=>  5[6orfy,/>[a;i/ai,  •  •  -  ,xn/an],  <  geten v,p'  >::  P],M  U  {(a8,u8) |1  <  i  <  n}  ,  <Q 
where  a\  ■  ■  ■  an  are  new  addresses 

If  Vi  =  hd(Qqi)  7^  empty,  1  <  i  <  n 

S[primop{F),  p,<  getval,  ••  -,qn  >::  P]done  ,  M,  Q 

=>-  S[primop(F),  p,  <  getval,  q\,  -  ■  ■  ,qn  >::  P]<w 

[(9P(P,  Pst(u1---u,J),p,P](W,M,  degMene(<Q,{g8|l  <  i  <  n}) 

If  tq  =  hd(Qqi)  7^  empty,  1  <  i  <  n 

S[closure(list(x\  ■  ■  ■  xn),body,  p),  p1,  <  getval,  q1,  ■  ■  • ,  qn  >::  P]done  ,  M  ,  Q 
=>•  5[c/osMre(/isf(a;i  •  •  •  xn),body,  p),  p1,  <  getval,  q1,  ■  ■  -,qn  >::  P]done 
[body,  p[x\j  a>i,  ■  ■■,xn/an\,  <  geten  v,p'  >::  P], 

M  U  {(ai,  ty)|l  <  i  <  n},  dequeue(Q ,  {g8|l  <  i  <  n}), 
where  ai  •  •  •  an  are  new  addresses 


4.3.4  Blocking 

If  the  lock  of  an  exclusive  function  is  not  free,  the  invocation  of  it  will  be  blocked  until 
the  shared  lock  is  released.  In  the  parallel  case,  when  the  lock  is  free,  evaluated  argu¬ 
ments  that  are  dequeued  and  supplied  to  the  pcalled  exclusive  function  will  be  dispatched 
directly  without  going  through  the  queue  associated  with  the  exclusive  function.  This  is 
to  preserve  the  order  of  values  that  are  sent  to  these  ports. 
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If  Ma  =  free  and  hd(Qq)  =  list(v \  ■  ■  -vn) 

S[exprimop(F,  q,a),p',  <  exgetval ,  g  >::  P]done  ,  M  ,  Q 
=>•  S[exprimop(F,  q,a),p',  <  exgetval,  q  >::  P]done 

[OP(F,list(v i,  •  •  -,vn)),p,  <  release,  a  >::  P]done, 
update(M,a,\ocked),dequeue(Q,{qi\l  <i<  n}) 

If  Ma  =  free  and  hd(Qq)  =  list(v \  ■  ■  -vn) 

S[exclosure(list(xi  ■  ■  ■  xn),  body ,  p,  q,  a),  p1,  <  exgetval,  q  >::  P]done  ,  M  ,  Q 
=>■  5[ea;c/osMre(/isf(a;1  •  •  •  sn),  body ,  p,  q,  a),  //,  <  exgetval,  q  >::  P]done 
[6odp,  p[a:i/ai,  •  •  • ,  sn/an],  <  release,  a  >::<  getenv,  p'>::P], 
update(M,  a,  locked)  U  {(a;,  ty)|l  <i<  n} ,  dequeue(Q ,  {g8|l  <  i  <  n}) 
where  ai  •  •  •  an  are  new  addresses 

5[n,p,  <  release,  a  >::  P]<w  ,  M  ,  Q  =>  S[v,p,P]done  ,  update(M,a,  free)  ,  Q 

If  Ma  =  free  and  ty  =  hd(Qqi)  7^  empty,  1  <  i  <  n 
S[exprimop(F ,  g,  a),p,  <  getval,  gx,  •  •  • ,  qn  >::  P]<w,  -M",  Q 
=>•  S[exprimop(F,  q,  a),p,  <  getval,  gi,  •  •  • ,  gn  >::  P]<w 
[CP(P,  •  •  -  ,vn)),p,  <  release,  a  >::  P]<w, 

update(M,  a,  locked),  dequeue(Q,  {g,|l  <i<  n}) 

If  Ma  =  free  and  ty  =  hd(Qqi ),  1  <  i  <  n 

5[ea;c/osMre(/isf(a;i  •  •  •  xn),body,  p,  q,a),  p',  <  getval,  gx,  •  •  • ,  qn  >::  P]done,M,Q 
=>  S[exclosure(list(x\  ■  ■  -xn),body,  p,q,a),  p' ,  <  getval,  gx,  •  •  ■  ,qn  >■■  P]done 
[body,  p[x\/a\,  •  •  • ,  xn/an\,  <  release,  a  >::<  getenv,  p'  >::  P], 
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update(M,  a ,  locked)  U  {(a;,  Vi)\l  <  i  <  n},  dequeue(Q ,  {g8|l  <  i  <  n}) 
where  ai  •  •  •  an  are  new  addresses 


4.3.5  Environment  Recovery 

When  a  function  is  called,  its  lexical  environment  stored  in  the  closure  is  referred  for 
evaluation.  When  the  call  returns,  the  old  environment  is  retrieved. 


5[w,  p,  <  getenv,  p'  >::  P]<w  ,  M  ,  Q  =y  5[w,  p\  P]done  ,  M  ,  Q 


4.3.6  First  Class  Ports 

Ports  can  be  reified  as  language  constructs.  The  shared  information  about  the  reified 
port,  including  the  control  point,  current  environment  and  its  state,  is  stored  in  a  shared 
memory  location,  and  can  be  referenced  by  any  thread.  A  singleport  is  represented  by 
the  control  point,  its  current  environment,  and  its  active/closed  state.  In  the  case  of 
multiports,  if  the  current  port  is  associated  with  an  argument  position  of  a  pcall  expres¬ 
sion  (the  second  rule  below),  or  with  the  last  argument  position  of  an  exclusive  function 
call  expression  (the  third  and  fourth  rule  below),  the  queuing  information  in  the  current 
port  has  to  be  stored  and  shared  as  well.  Otherwise,  only  the  control  point  and  current 
environment  information  are  stored,  and  the  multiport  degenerates  to  be  a  procedural 
abstraction  (the  fifth  rule  below).  In  general,  a  port  provides  ordering  information  only 
when  ordering  is  important  at  the  control  point  the  port  represents. 

5[(call/sp  F),p,P]  ,  M  ,  Q 

=}►  S[F,p,<  CallWith  ,  a  >::<  refport,  a  >],  M  U  {(a,  <  singleport,  p  Q 
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5[(call/mp  F),p,  <  enter,  q  >]  ,  M  ,  Q 


=>  S[F,  p,  <  CallWith  ,  a  >::<  refport,  a  >],  M  U  {(a,  <  multiport,  p,  q  >)},Q 

^[(call/mp  F),p,  <  EvalArg,  n,  exclosure(args ,  body ,  //,  g,  a),  tq,  •  •  • ,  >::  P]  ,  M  ,  Q 

=}►  5[P,p,<  CallWith  ,  a  >::<  refport,  a  >]M  U  {(a,  <  multiport,  p,  q  >)},Q 

^[(call/mp  F),p,  <  EvalArg,  n,  exprimop(F ,  g,  a),  tq,  •  •  • ,  tv_i  >::  P]  ,  M  ,  Q 
=}►  5[P,p,<  CallWith  ,  a  >::<  refport,  a  >]M  U  {(a,  <  multiport,  p,  q  >)},Q 

If  P  enter,  q  >,  or 

<  EvalArg,  n,  ex  closure^- t=),  tq,  •  •  •  ,vn  >::  P'  or 

<  EvalArg,  n,  exprimop(- t=),  tq,  •  •  -  ,vn  >::  P' 

5[(call/mp  P),  p,  P]  ,  M  ,  Q 

=>■  5[P,  p,  <  CallWith,  a  >::  P],  M  U  {(a,  <  getenv,  p  >::  P)},  Q 


4.3.7  Parallel  Control 

A  (die)  expression  terminates  the  execution  of  the  current  thread  without  returning  any¬ 
thing.  If  during  the  invocation  of  an  exclusive  function,  a  (die)  is  evaluated,  the  lock  is 
released  so  that  pending  requests  to  call  the  exclusive  function  can  be  satisfied.  A  (die) 
releases  all  locks  in  the  current  thread  if  it  happens  in  nested  exclusive  function  calls. 

5[(die),p,  RETURN]  ,  M  ,  Q  =y  S  ,  M  ,  Q 
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5[(die),  p,  <  enter,  q  >]  ,  M  ,  Q  =y  S  ,  M  ,  Q 

5[(die),  p,  tag  ::  P]  ,  M  ,  Q 

j"  5[(die),  p,  P]  ,  update(M,a,  free)  ,  Q  if  tag  =<  release,  a,  g  > 
I  5[(die),  p,  P]  ,  M  ,  Q  otherwise 


A  throw  expression  causes  the  left-to-right  evaluation  of  its  arguments,  and  passes 
the  result  of  the  second  argument  to  the  result  of  the  first  argument  (a  location  storing  a 
port).  The  semantic  rules  of  throw  guarantee  that  two  throws  by  the  same  thread  to  a 
multiport  with  a  queuing  mechanism  will  cause  the  values  to  be  enqueued  in  the  throwing 
order. 

5[(throw  A\  A2),p,  P],M,Q  =t-  S[A\,p,  <  throw,  A2  >::  P],M,Q 


S[v,p,  <  throw,  A2  >::  P]done,M,Q 


S[A2,p,  <  PassTo,  v  >::  P],  iff,  Q 


5[r,  p,  <  PassTo,  a  >::  P]done,M,  Q  =y 
' 

5[r,  p,  P]done[v,  p',  P%one,update(M ,  a,  closed),  Q 

S[v,p,P]done,M,Q 

5[r,  p,  P]done,  M,  enqueue(Q,q ,  v) 

I  S[v,p,P]done[v,p',P'] 

done  i  M,Q 


if  Ma  =<  singleport,  p'  >::  P' 
if  Ma  =  closed 
if  Ma  =<  multiport,  p',  q  > 
if  Ma  =<  getenv,  p'  >::  P' 


^[r,  p,  <  singleport,  p'  >::  P']done ,  M,  Q  =>■  ^[r,  p',  P'],  update(M,  a,  closed),  Q 

5[r,  p,  closed  . - \l.  ()  :  S.\I.() 
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S[v,p,  <  multiport,  p\  q  >]done,M,Q  =y  S,M,enqueue(Q,q,v) 


p.  <  refport,  a  >]done,  M,  Q  =y  p,  Ma]done,  \l.  Q 


4.4  An  Example 

We  illustrate  in  this  section,  with  the  short  example  (pcall  +  (call/mp  (lambda(p) 
1))  2),  that  our  operational  semantics  correctly  defines  the  meaning  of  a  Pscheme  ex¬ 
pression.  The  following  series  of  reduction  shows  that  the  expression  will  return  3  as  its 
final  answer.  In  this  example,  the  series  of  rules  to  apply  is  chosen  nondeterministicahy. 
However,  it  is  not  difficult  to  observe  that  the  order  of  reduction  rules  chosen  will  not 
affect  the  terminating  configuration,  in  which  the  value  3  will  be  returned  as  the  final 
result. 


[(pcall  +  (call/mp  (lambda(p)  1))  2) ,{},RETU RN],  {},  {} 

=>■  [+,{},<  Para  Apply  With ,  (call/mp  (lambda(p)  1) )  ,2>::RETURN],  {},  {} 

=y  [primop(-\-),  {},  <  ParaApplyWith,  (call/mp  (lambda(p)  1) ) ,  2  >::  RETU  RNJ^one, 
{},{} 

=y  \primop(  +  ),  {},  <  getval,  qi,  g2  >::  RETURN]done 

[(call/mp  (lambda(p)  1) ),{},<  enter,  qi  >] [2,  {},  <  enter,  q2  >], 

{},  {(gi,  empty),  (q2,  empty)} 

=y  \primop(  +  ),  {},  <  getval,  qi,  q2  >::  RETURN]done 

[(call/mp  (lambda(p)  1) ),{},<  enter,  qi  >] [2,  {},  <  enter,  q2  >], 
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{},  {(gi,  empty),  (g2,  empty)} 

=>-  \primop(  +  ),  {},  <  getval,  gi,  q2  >::  RETURN]dorie 

[(lambda(P)  1) ,  {},  <  CallWith,  a  >::<  refport,  a  >] [2,  {},  <  enter,  q2  >], 

{(a,  <  multiport,  {},  qt  >},  {(gi,  empty),  (g2,  empty)} 

=>•  \primop(  +  ),  {},  <  getval,  gi,  q2  >::  RETURN]done 

[closure(list(p),  1,  {}),  {},  <  CallWith,  a  >::<  refport,  a  >]done[ 2,  {},  <  enter,  q2  >], 
{(a,  <  multiport,  {},  qt  >},  {(gi,  empty),  (g2,  empty)} 

=>•  [pr imop( +  ),{},  <  getval,  qt,q2  >::  RETURN]done 
[1,  {(p,  a)},  <  refport,  a  >] [2,  {},  <  enter,  g2  >], 

{(a,  <  multiport,  {},  gx  >},  {(gi,  empty),  (g2,  empty)} 

=>-  [pHmop(+),{},<  getval,  gi,g2  >::  RETURN]done 
[1,  {(p,  a)},  <  refport,  a  >]done[2,  {},  <  enter,  g2  >], 

{(a,  <  multiport,  {},  gx  >},  {(gi,  empty),  (g2,  empty)} 

=>•  [pHraop(+),{},<  getval,  gi,g2  >::  RETURN]done 
[1,  {},  <  multiport,  {},  q1  >]done[2,  {},  <  enter,  g2  >], 

{(a,  <  multiport,  {},  gx  >},  {(gi,  empty),  (g2,  empty)} 

=>•  [pHmop(  +  ),{},<  getval,  gi,g2  >::  RETU  RN]done[2,  {},<  enter,  g2  >], 

{(a,  <  multiport,  {},  gx  >},  {(q1, 1::  empty),  (g2,  empty)} 

=>•  [pHmop(  +  ),{},<  getval,  gi,g2  >::  RETU  RN]done[2,  {},<  enter,  g2  >]done , 

{(a,  <  multiport,  {},  gx  >},  {(gl5 1::  empty),  (g2,  empty)} 

=>-  [primop(  +  ),{},  <  getval,  q1,q2  >::  RETURN]done 

{(a,  <  multiport,  {},  q1  >},  {(gi ,  1::  empty),  (g2,  2::  empty)} 

=>-  [pHmop(+),{},<  getval,  gi,g2  >::  RETURN]done 
[3,  {},  RETURN] 

done  i  {(a,  <  multiport,  {},  q±  >},  {(gi,  empty),  (g2,  empty)} 
(Termination  condition) 
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Chapter  5 


The  Compiler 


Our  implementation  of  Pscheme  consists  of  two  parts:  the  compiler  and  the  run  time 
system.  The  compiler  generates  portable  C  code  so  that  Pscheme  can  run  on  different 
parallel  platforms.  The  run  time  system  takes  care  of  resource  management,  ordering, 
locking  and  process  control.  We  describe  our  compilation  process  in  this  chapter  and  the 
run  time  system  in  the  next. 

The  Pscheme  compiler  is  based  on  the  approach  of  the  SML/NJ  compiler  [4,5].  Be¬ 
cause  our  port  mechanism  is  an  extension  of  continuations,  we  only  need  to  modify  the 
continuation-passing-style  program  representation  so  as  to  be  able  to  characterize  multi¬ 
ple  threads  and  ports.  By  continuation  passing  style,  we  mean  all  function  calls  are  tail 
calls,  and  all  function  bodies  end  with  function  call  expressions.  In  general,  a  compiled 
Pscheme  program  will  have  the  following  features. 

•  C  code,  instead  of  assembly,  is  generated  to  increase  portability  among  different 
parallel  platforms. 

•  All  objects  are  heap  allocated.  The  compiler  does  no  escape  analysis  to  decide 
whether  a  closure  (or  a  continuation)  should  be  stack-allocated  or  heap-allocated. 


•  Every  thread  is  in  the  continuation  passing  style,  and  is  abstracted  as  a  single 
function  application  expression. 

•  Ports  are  parallel  data  structures  containing  parallel  control  information  such  syn¬ 
chronization,  thread  creation  and  thread  termination. 

Our  operational  semantics  associates  each  subexpression  in  a  Pscheme  program  with 
an  implicit  port,  and  interprets  the  program  in  the  port  passing  style.  Our  compiler  follows 
this  viewpoint  except  that  ports  will  only  be  constructed  when  they  are  made  explicit  (via 
call/mp,  call/sp  or  pcall).  Since  the  ordering  is  important  only  when  there  is  an  order 
among  elements  passed  to  an  explicit  port,  the  continuation  passing  thread  representation 
suffices  if  ports  can  be  encapsulated  in  continuations,  and  the  order  of  elements  passed 
to  a  port  can  be  guaranteed.  In  other  words,  passing  a  value  to  a  port  is  the  same  as 
invoking  a  continuation  which  does  nothing  but  the  pass  operation.  1 

The  Pscheme  compiler  is  composed  of  a  series  of  program  transformation  processes, 
which  transforms  the  input  program  into  an  intermediate  representation  for  the  code  gen¬ 
erator  to  use.  These  stages  are: 

1.  Syntactic  preprocessing 

2.  Scope  analysis 

3.  Assignment  conversion 

4.  CPS  transform 

5.  Closure  conversion 

6.  Thread  abstraction 

7.  Code  generation 


i 


The  pass  operation  is  implemented  as  a  parallel  primitive  operation  in  the  run  time  system. 
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Since  the  Pscheme  compiler  generates  C  code,  we  do  not  perform  register  allocation. 
Instead,  we  assume  a  sufficient  number  of  virtual  registers  allocated  by  the  run  time  sys¬ 
tem,  and  those  virtual  registers  can  be  cached  (for  example,  declared  as  register  variables) 
in  C. 


5.1  Intermediate  Representation 


Our  compiler  assumes  the  input  program  to  be  a  series  of  define  expressions  followed  by 
a  series  of  regular  Pscheme  expressions.  The  answer  to  the  program  will  be  the  return 
value  of  the  last  expression  under  the  bindings  of  all  those  define  expressions  and  the 
side  effects  of  previous  expressions.  With  a  legal  input,  these  transformations  generate 
an  intermediate  representation,  which  is  a  big  letrec  expression  in  CPS  form.  2  A  CPS 
expression  cexp  in  our  compiler  is  a  tagged  tuple  defined  as  follows: 


2 The  port-passing  style  representation  used  in  the  operational  semantics  is  not  adopted  here  for  there 
is  no  need  to  create  a  port  for  each  subexpression.  As  mentioned  above,  a  port  is  only  created  when  the 
ordering  matters.  Instead,  continuation-passing  style  suffices  as  long  as  we  are  able  to  encapsulate  relevant 
port  information  in  continuations. 
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cexp  =  <  letrec , binding  list,cea;p> 

or  <  primop,  sexp  list ,  id, cexp  > 


or 

< 

content, id,  cexp  > 

or 

< 

assign,  id,  sexp,  id,  cexp  > 

or 

< 

if,  id,  cexp,  cexp  > 

or 

< 

para,  apply-expression  list  > 

or 

< 

pass,  sexp, id  > 

or 

< 

enq  —  port,  sexp,  id  > 

or 

< 

deq  —  port,  id  > 

or 

< 

apply,  id,  sexp  list  > 

or 

< 

end, sexp  > 

or 

< 

die  > 

where 

•  id  denotes  a  legal  identifier  in  Pscheme, 

•  sexp  denotes  a  “simple  expression” ,  which  is  either  an  identifier  or  a  language  level 
constant  (which  could  be  a  number,  a  character,  a  boolean,  a  string  or  a  quoted 
symbol),  and 

•  a  binding  is  a  (id,  bexp)  pair  where  bexp  is  a  data  structure  expression  defined  as 
follows. 
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bexp 


=  <  lambda,  id  list ,cexp> 

or  <  singleport,  id  > 
or  <  multiport,  id  > 
or  <  exclusive,  id  > 

or  <  build  —  port,  id,  int  > 
or  <  sync,  id,  int,  id  > 
or  <  bind,  id,  sexp  > 
or  <  vector,  constant  list  > 
or  <  cell,  id  > 
or  <  closure,  id  list  > 
or  <  select,  id,  int  > 
or  <  offset,  id,  int  > 

or  a  cexp  containing  only  primop,  content,  assign  or  if  expressions 


A  binding  expression  bexp  represents  the  construction  of  a  compound  data  object  to 
be  bound  to  an  identifier.  It  only  appears  in  the  letrec  binding. 

In  the  intermediate  representation,  the  only  thread-creating  expression  is  the  para 
expression,  which  specifies  a  list  of  thread  abstractions  to  create,  each  of  which  is  actually 
an  apply  expression.  Passing  values  to  ports  are  done  by  the  pass  expression,  which  can 
either  terminate  the  current  thread  or  activate  a  new  continuation  that  the  port  object 
encapsulates  depending  on  the  locking  information  in  the  port  object.  The  end  expression 
returns  the  answer  to  the  top  level.  The  die  expression  terminates  the  current  thread. 

All  other  expressions  are  similar  to  their  sequential  CPS  counterparts.  The  apply 
expression  is  a  no-return  function  call  (a  tail  call).  The  primitive  function  call  (the 
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primop  expression  shown  above)  is  also  in  CPS  form  except  that  its  continuation  is  not 
explicitly  constructed  to  take  the  result  as  the  argument.  Instead,  we  bind  a  new  variable 
(the  component  id)  with  the  result  and  perform  the  subsequent  computation  (the  last 
component  cexp)  directly.  The  same  case  applies  to  the  content  and  assign  expressions. 

The  definition  of  the  binding  expression  bexp  represents  objects  of  all  data  types  aside 
from  those  of  expressible  constants  (numbers,  characters,  booleans,  strings  and  quoted 
symbols).  It  is  worth  noticing  that  we  have  distinguished  lambda  objects  (or  lambdas)  and 
closure  objects  (or  closures)  so  that  lambdas  contain  primarily  the  code  information  of  a 
lambda  expression,  and  closures  are  records  whose  fields  can  be  lambdas  or  free  variable 
values.  (Our  transformation  always  builds  closure  records  with  several  consecutive  lambda 
fields  followed  by  several  consecutive  free  variable  fields.) 

Closures  may  possibly  have  common  free  variables.  In  order  to  let  closures  share  fields, 
a  closure  is  characterized  by  a  pointer  to  a  record  and  a  non-negative  offset  that  locates 
the  closure’s  base  address  (where  in  the  record  the  closure  starts)  by  the  distance  to  that 
record’s  first  field,  select  selects  a  field  of  a  closure,  offset  returns  another  closure  sharing 
some  fields  with  the  original  closure  record  but  with  a  different  base  address  (that  is,  a 
different  offset ).  exclusive  takes  a  lambda  and  returns  another  identical  one  except  that 
calling  it  is  exclusive. 

bind  binds  the  label  with  a  simple  constant  or  another  identifier.  A  cell  <  cell,  id  > 
is  a  pointer  object  pointing  to  some  location  stored  with  the  object  that  id  denotes. 
Port-related  expressions  will  be  explained  later  in  this  chapter. 

A  binding  expression  can  also  be  a  CPS  expression  as  long  as  function  applications  or 
parallel  primitives  do  not  occur  in  it.  The  principle  is  to  allow  only  primitive  operations 
and  storage  operations  in  a  binding  expression  so  as  not  to  violate  the  CPS  form. 
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Notice  that  our  binding  expression  contains  no  construct  for  building  lists  or  cons-cells. 
Cons-cells  are  only  constructed  through  the  primitive  operator  CONS.  Even  for  a  constant 
list,  the  compiler  will  create  a  new  identifier  (and  allocate  space)  for  each  cons-cell  needed 
to  construct  the  list. 

5.2  Syntactic  Preprocessing,  Scope  analysis  and 
Assignment  Conversion 

5.2.1  Syntactic  Preprocessing 

The  first  stage  of  the  Pscheme  compiler  translates  syntactic  forms  into  Pscheme  domain 
expressions.  There  are  four  kinds  of  syntactic  forms  related  to  conditional  expressions: 
case,  cond,  and  and  or.  They  are  all  transformed  into  if  expressions.  3 

The  let  form  is  translated  into  a  lambda  application  expression  as  follows. 

(let  ((xl  expi) . . (xn  expn))  body) 

=y  ((lambda  (xl  ..  xn)  body)  expi  ..  expn) 

The  letrec  expression,  according  to  Scheme’s  definition,  should  be  translated  as  fol¬ 
lows. 

(letrec  ((xl  exp\) .  . (xn  expn))  body) 

=y  (let  ((xl  nil) . . (xn  nil))  (set!  xl  exp\) . . (set !  xn  expn)  body) 

The  letrec  form  is  usually  used  to  express  mutually  recursive  functions,  but  Scheme’s 
letrec  form  means  more.  Transforming  a  letrec  expression  according  to  the  definition 
may  cause  extra  inefficiency.  We  would  like  to  keep  a  structure  characterizing  mutual 
recursion  at  the  intermediate  representation  level.  Such  a  selective  transformation  will  be 

3 The  transform  algorithm  is  trivial  by  definition,  and  is  thus  omitted. 
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discussed  in  more  details  in  section  5.3.3. 


5.2.2  Scope  Analysis 

The  next  stage  is  scope  analysis ,  which  renames  identifiers  with  the  same  names  but 
appearing  in  different  scopes  so  that  every  identifier  name  is  unique.  The  reason  to 
perform  such  transformation  is  that  it  makes  it  easier  to  restructure  the  program  by 
removing  nested  binding  blocks.  (We  call  it  hoisting .)  This  is  done  simply  by  recording 
the  scope  pointer  in  the  entry  of  each  identifier  in  the  symbol  table  so  that  each  occurrence 
of  the  variable  to  rename  can  be  found. 

5.2.3  Assignment  Conversion 

In  Scheme  (and  all  imperative  languages),  identifiers  denote  values  sometimes  and  lo¬ 
cations  in  other  time.  In  most  imperative  languages,  the  compiler  has  to  distinguish 
left-hand-side  identifiers  and  right-hand-side  identifiers  to  generate  appropriate  code.  In 
an  expression  language  such  as  Scheme,  where  every  expression  returns  a  value,  this  vi¬ 
olates  the  orthogonality  of  an  expression’s  meaning.  One  solution  to  this  problem  is  to 
let  the  compiler  explicitly  distinguish  locations  and  values  in  its  intermediate  representa¬ 
tion.  Some  Scheme  compilers  (such  as  ORBIT  [47])  transform  all  the  set!  expressions  so 
that  the  meanings  of  identifiers  are  not  subject  to  change,  and  the  compiler  can  translate 
program  expressions  regardless  where  they  appear.  The  Pscheme  compiler  also  performs 
such  a  transformation.  The  principle  is  shown  as  follows. 

Case  1  :  Binding  through  lambda 
(lambda  (x)  ..  x  ..  (set!  x  v)  ..  ) 
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(lambda  (x) 


(let  ((x’  (cell  x))) 

. .  (content  x’ ) 

(assign  x’  v) 

...  )) 

(lambda  (x) 

((lambda  (x’) 

. .  (content  x’ ) 

(assign  x’  v) 

...  ) 

(cell  x))) 

Case  2  :  Binding  through  letrec 

(letrec  (..  (x  init-exp)  ..) 

.  .  x  .  . 

(set!  x  v) 

..  ) 


(letrec  (..  (x  init-exp)  (x’  (cell  x)) 
. .  (content  x’ ) 


(assign  x’  v) 

..  ) 


71 


That  is,  if  an  identifier  x  will  be  assigned  a  new  value  by  set!,  we  create  a  new 
identifier  x’  denoting  the  location  storing  x.  For  all  references  to  the  value  of  x,  we  use 
(content  x’)  in  place,  and  for  all  set!  expressions  referring  the  location  of  x  such  as 
(set!  x  exp),  we  use  (assign  x’  exp)  instead. 

After  assignment  conversion,  we  are  able  to  perform  CPS  transform,  the  kernel  part 
of  the  compiler. 

5.3  The  CPS  Transform  Process 

The  CPS  transform  is  a  technique  used  in  compilers  for  languages  supporting  first-class 
continuations.  If  the  implementation  is  stack  based,  capturing  a  continuation  may  require 
saving  the  whole  run  time  stack,  and  invoking  the  continuation  requires  restoring  the  stack. 
If  the  program  is  in  continuation  passing  style,  all  function  calls  are  tail  calls.  Invoking 
a  continuation  is  the  same  as  calling  a  function  because  neither  needs  to  return.  The 
run  time  stack  degenerates  to  an  activation  record  since  the  callee’s  activation  can  always 
overwrite  the  caller’s  activation. 

Pscheme’s  compiler  uses  an  extension  of  the  CPS  transform  to  characterize  ports  and 
parallelism.  We  will  discuss  the  CPS  transform  in  Pscheme’s  sequential  part  first,  and 
then  extend  it  to  the  whole  parallel  language. 

5.3.1  Transforming  Sequential  Expressions 

Continuation  passing  style  is  a  form  for  program  expressions  in  which  any  lambda  expres¬ 
sion  and  any  function  call  satisfy  the  following  property: 

1.  The  last  step  of  a  lambda  expression’s  body  must  be  a  function  call  expression 

2.  Any  function  call  expression  must  appear  at  the  tail  position  of  a  lambda  expression’s 
body 
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Before  the  CPS  transform,  if  in  an  expression,  a  function  call  does  not  appear  in  a  tail 
position,  whatever  follows  the  function  call  will  be  wrapped  into  another  function,  which 
we  call  the  function’s  continuation.  This  continuation  is  passed  as  a  parameter  of  the 
function  call,  and  will  be  invoked  later.  After  such  a  transform  process,  the  expression  is 
in  the  continuation  passing  style  (CPS).  In  a  CPS  expression,  each  function  takes  one  more 
argument,  the  continuation,  than  its  original  form.  All  continuation  functions  constructed 
during  the  transform  are  one-argument  functions. 

For  example,  the  factorial  function 

(define  (fac  n)  (if  (eq?  On)  1  (*  n  (fac  (-  n  1))))) 

can  be  transformed  into  continuation  passing  style  as  follows. 

(define  (fac  n  cont) 

(eq?  0  n  (lambda  (vl) 

(if  vl 

(cont  1) 

(-  n  1  (lambda  (v2) 

(fac  v2  (lambda  (v3) 

(*  n  v3  cont))))))))) 

This  new  factorial  function  satisfies  the  two  CPS  properties  stated  above.  When  fac 
is  called  at  the  top  level,  the  continuation  argument  cont  is  a  function  that  just  prints 
its  argument  and  terminates. 

The  CPS  transform  algorithm  for  sequential  expressions  is  shown  in  figure  5.1.  The 
transformation  function  CPS  takes  a  language  domain  expression  (in  bold  fonts  and 
Pscheme  syntax)  and  a  (meta-level)  continuation  function,  and  returns  a  transformed 
domain  expression.  The  continuation  function  can  be  reified  to  become  a  language  level 
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continuation  (a  one-argument  lambda  expression)  during  the  transformation.  The  begin 
expression  actually  represents,  in  general,  any  list  of  expressions  to  be  evaluated  sequen¬ 
tially  such  as  the  body  of  a  lambda  or  a  letrec. 

As  we  have  explained  in  the  intermediate  representation,  primitive  function  calls  are 
also  in  CPS  form,  but  no  continuation  for  a  primitive  operation  is  explicitly  constructed 
in  a  lambda  form.  Instead,  we  create  a  new  variable  bound  to  the  result  of  the  primitive 
operation,  and  perform  the  computation  following  it.  Therefore,  the  factorial  function, 
after  the  CPS  transform,  has  the  representation  shown  in  figure  5.2. 

The  transform  automatically  builds  a  continuation  function  representing  the  top  level 
(the  end  instruction)  so  that  when  it  is  invoked,  the  answer  is  printed  and  the  computation 
is  complete. 

Transforming  a  letrec  expression  needs  some  preprocessing,  and  will  be  discussed  in 
section  5.3.3. 

5.3.2  Ports  and  Parallel  Constructs 

The  CPS  transform  in  the  Pscheme  compiler  is  extended  to  deal  with  pcall,  throw, 
die,  call/sp,  call/mp  and  exclusive.  The  idea  is  to  enscapsulate  port  information  in 
continuations,  and  to  use  a  uniform  construct  para  to  specify  thread  creation.  (That  is, 
parallelism  is  only  generated  via  the  command  para.) 

In  the  compiler’s  internal  representation,  a  port  created  by  call/mp  or  call/sp  is 
an  asynchronous  port ,  which  contains  internally  a  FIFO  queue  for  objects  passed  in  and 
a  pointer  to  a  continuation  function  representing  the  rest  of  the  computation.  A  port 
created  by  pcall  is  a  synchronous  port ,  which  internally  also  contains  a  FIFO  queue,  a 
pointer  to  a  synchronizer ,  (which  is  the  function  whose  argument  position  is  associated 
with  the  port)  and  an  index  indicating  which  argument  position  of  the  synchronizer  the 
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CPS  :  E  X  (E  E)  E 

CPS(e,k )  =  A;(e)  if  e  G  Constants  U  Identifiers 
CPS(( if  el  e2  e3),k) 

=  CPS(el,Xv.(iiv  CPS(e2,k )  CPS(e3,k )  )  ) 

C'P5'((content  x),&)  =  (content  x  reify(k)) 

CPS(( asign  x  e),  k)  =  CPS(e ,  An.  (assign  x  n  reify(k)  )  ) 

C'P5((func  argl  arg2  ..  argn),A;) 

=  C'P5(func,  A/.CP5'(argl,  Ani.  •  •  •C'P5(argn,  Ann.(/  nx  •  •  -  n^,  reify(k )))  ■  ■  •) 

C'P5'((lambda  args  body),  A;) 

=  A;(  (lambda  append(args,list(k))  C'P5(body,  At>.(k  n))  )  ) 
where  k  is  a  new  identifier 

C'P5((begin  el  ..  en),A;) 

=  C'PS(el,  \Vl.  CPS( ■  --CPSi en,  A vn.(reify(k)  vn))  ■  •  •) 

C'P5((letrec  ((xl  el)  ..  (xn  en))  body),  A;) 

=  (letrec  ((xl  CPS(e.  1,  An.  (bind  xl  n)  )  )  ..  (xn  CPS(e n,  An.  (bind  xn  n)  )  )) 

CPS(body,k)  ) 


reify( An. exp)  =  (lambda(v)  exp[n/v]) 


Figure  5.1:  The  CPS  transform  algorithm  for  sequential  constructs 
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(define  (fac  n)  (if  (zero?  n)  1  (*  n  (fac  (-  n  1))))) 
(fac  3) 


LETREC 

FAC  =  LAMBDA  I  fl 
LETREC 

f 3  =  LAMBDA  v2 

APPLY  fl  v2 

PRIM  EQ?  ON  v5 
IF  v5 

THEN  APPLY  f3  1 
ELSE  LET 

f 13  =  LAMBDA  vl2 

PRIM  *  I  vl2  vl4 
APPLY  f 3  vl4 

PRIM  -  I  1  vll 
APPLY  FAC  vll  f 13 

f 17  =  LAMBDA  vl7 

END  vl7 

APPLY  FAC  3  f 17 


Figure  5.2:  The  CPS  transformed  Factorial  function 
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port  represents. 

Transforming  a  pcall  expression  creates  a  synchronizer  object,  which  specifies  inter¬ 
nally  the  pcalled  function,  its  arity  and  its  continuation  so  that  when  a  data  element  is 
received  at  each  of  its  synchronous  ports ,  the  pcalled  function  can  be  invoked  with  the 
continuation  argument  specified  in  the  synchronizer.  These  compiler-constructed  objects 

are  represented  by  the  following  instructions. 

<  singleport,  continuation  > 

<  multiport,  continuation  > 

<  buildport,  pcalled -function,  index  > 

<  sync,  pcalled -function,  arity,  continuation  > 

All  of  these  are  binding  expressions  that  only  occur  in  the  letrec  binding  in  a  trans¬ 
formed  expression.  The  CPS  transform  algorithm  for  parallel  constructs  is  shown  in 
figure  5.3,  We  give  short  examples  for  these  instructions  to  show  how  the  transformation 
works. 

Transforming  the  expression  (pcall  eO  el  .  .  en) 

The  compiler  transforms  eO ,  el,  ..  en,  and  generates  codes  that  evaluate  eO  first,  and 
then  el  to  en  in  parallel.  For  example,  the  following  pcall  expression 

(pcall  (lambda(x  y)  (+  x  y))  1  2) 

can  be  CPS  transformed  into: 

LETREC 

f 3  =  LAMBDA  X  Y  f2 

PRIM  +  X  Y  v4 
APPLY  f 2  v4 
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C'P5'((pcall  eO  el  ..  en),&) 

=  CPS(eO,  A/.(letrec  ((s  (make-synchronizer  f  reify(k )))) 

(para 

CPS(el,  An. (pass  v  s)) 

CPS(e n,  An. (pass  vs))))) 

C'P5'((call/mp  func),A;) 

=  C'P5(func,  A/.(letrec  ((p  (multiport  reify(k)))))  (/  p  reify(k )))  ) 
C'P5'((call/sp  func),A;) 

=  C'P5(func,  A/.(letrec  ((p  (singleport  reify(k)))))  (/  p  reify(k)))  ) 
CPS(( throw  el  e2),k ) 

=  CPS(el,  \p.CPS(e2,  Xv. (begin  (enq-port  v  p) 

(para  (deq-port  p) 

(reify(k)  v)))  )) 

C'P5((exclusive  func),A;) 

=  C'P5(func,  A/.(letrec  ((f’  (exclusive  /)))  (reify(k)  f’))  ) 
CPS((die),k)  =  (die) 


reify( An. exp)  =  (lambda(v)  exp[t>/v]) 

Figure  5.3:  The  CPS  transform  algorithm  for  parallel  constructs 
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f  7 


LAMBDA  v5 


END  v5 
si  =  SYNC  f 3  2  f 7 
pi  =  BUILD-PORT  si  1 
p2  =  BUILD-PORT  si  2 


PARA 

PASS  1  pi 
PASS  2  p2 

Since  a  two-argument  function  is  called,  two  synchronous  ports  and  a  synchronizer 
are  built.  Our  idea  is  to  keep  a  parallel  expression  in  CPS  form,  while  dealing  with  side 
effects  concerning  parallelism  and  synchronization  through  shared  data  structures. 


Transforming  throw,  call/sp  and  call/mp  expressions 

The  throw  expression  causes  left-to-right  evaluation  of  the  arguments  to  get  a  port  p  and 
a  value  v.  It  passes  v  to  p,  and  then  spawns  a  new  thread  which  dequeues  the  port  p  and 
performs  the  computation  encapsulated  in  the  port.  4 

In  the  next  example,  we  introduce  our  three  port  operations:  <  pass ,v,p  >, 
<  enq-port,  v,p  >,  and  <  deq-port,p  >.  <  enq-port,  v,p  >  puts  an  element  v  into 

the  queue  for  port  p.  <  deq-port,p  >  dequeues  the  queue  for  port  p  to  get  a  value  v, 
and  starts  the  computation  that  follows  if  no  blocking  occurs.  In  some  cases,  such  as  bar¬ 
rier  synchronization  or  exclusive  access,  where  the  computation  following  the  port  cannot 

4 A  port  contains  internally  a  pointer  to  a  continuation  function,  which  will  be  invoked  whenever  an 
element,  if  any,  is  dequeued  from  its  internal  queue.  Invoking  this  continuation  performs  the  computation 
that  is  called  encapsulated  in  (or  following )  the  port. 
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proceed,  the  deq-port  operation  will  not  dequeue  elements  from  the  queue.  Instead,  the 
elements  will  stay  in  the  queue  until  blocking  is  released.  This  preserves  the  order  in 
which  elements  are  sent  to  a  port,  and  is  important  in  stream  based  programming. 

The  pass  operation  generally  combines  enq-port  and  deq-port.  The  reason  to  de¬ 
compose  it  into  two  operations  is  for  the  implementation  of  throw.  As  shown  in  figure  5.3, 
to  throw  an  element  into  a  port,  we  have  to  enqueue  it  before  spawning  a  new  dequeuing 
thread.  This  guarantees  an  observationally  atomic  throw,  that  is,  the  values  thrown  by 
the  same  thread  to  the  same  port  will  arrive  in  the  order  in  which  they  are  thrown. 

The  call/mp  and  call/sp  expressions  cause  the  creation  of  asynchronous  ports  ex¬ 
plicitly.  Here  is  an  example  combining  call/sp  and  throw. 

The  expression  (call/sp  (lambda(p)  (throw  p  1)))  can  be  transformed  to: 

LETREC 

fl  =  LAMBDA  vl 

END  vl 

pi  =  SINGLEPORT  fl 
f 2  =  LAMBDA  v2 

PASS  v2  pi 
f 4  =  LAMBDA  P  f3 
LETREC 

f 5  =  LAMBDA  v4 

APPLY  f 3  v4 

ENQ-PORT  1  P 

PARA 

APPLY  f 5  1 
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DEQ-PORT  P 


APPLY  f 4  pi  f 2 

Transforming  the  exclusive  expressions 

The  exclusive  construct  is  not  implemented  at  the  compiler  level.  The  locking  and 
ordering  mechanism  depends  on  the  facilities  provided  by  the  underlying  platform,  so  it 
is  implemented  in  the  run  time  system.  The  compiler  simply  binds  a  new  identifier  for 
every  occurrence  of  the  exclusive  construct. 

For  example,  the  ((exclusive  (lambda(x)  x))  1)  will  be  transformed  into: 

LETREC 

f 3  =  LAMBDA  X  f2 
APPLY  f 2  X 
fl  =  EXCLUSIVE  f 3 
f 5  =  LAMBDA  v3 

END  v3 

APPLY  fl  1  f5 

5.3.3  Mutual  Recursion  and  letrec 

In  our  CPS  transform,  we  keep  the  letrec  expression  as  our  basic  binding  construct.  (The 
let  forms  have  been  translated  as  lambda  applications.)  letrec  is  used  to  represent 
mutually  recursive  functions  primarily.  Unfortunately,  unlike  other  languages  such  as 
ML,  Scheme’s  letrec  also  introduces  circular  defined  objects  because  the  bindings  are 
not  restricted  to  lambdas.  Therefore,  there  could  be  expressions  like: 
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(letrec  ((x  x))  x)  or  (letrec  ((xl  x2)  (x2  xl))  xl) 

They  are  well  defined  in  the  syntactic  letrec  definition,  but  will  return  unexpected 
objects  depending  on  implementation. 

Our  Pscheme  compiler,  however,  rejects  such  circular  definitions  among  objects  other 
than  lambda  expressions.  That  is,  mutual  recursion  applies  only  to  lambda  definitions. 
For  non-lambda  objects  defined  in  the  same  letrec,  there  can  only  be  non-circular  de¬ 
pendencies  among  them.  As  a  result,  we  can  reorder  all  non-lambda  bindings  so  that 
there  is  no  forward  reference  among  them. 

Recall  that  aside  from  data  structure  building  operations,  a  binding  expression  bexp 
can  also  be  a  CPS  expression  containing  none  of  these  control  constructs:  apply,  para, 
pass,  end  and  die.  If  no  control  construct  appears  in  a  letrec  binding  part,  the  CPS 
transform  is  simply  applied  to  each  of  the  binding  expressions.  If,  however,  the  ith  non¬ 
lambda  binding  contains  a  control  instruction,  we  perform  the  following  transform  before 
the  CPS  transform. 

(letrec  ((/j  lbd{)  ..  (/m  lbdm)  (a?i  ei)  ..  (ay  et)  ..  (xn  en)) 

body) 

(letrec  ((/i  /6di)  ..  (/m  lbdm) 

{x\  ei)  ..  (ay_i  e8_i)  (ay  nil)  ..  (xn  nil)) 

(set!  ay  et) 

(set!  xn  en ) 
body) 

The  assignment  conversion  is  needed  again  for  such  letrec  expressions  that  generate 
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new  assignments  during  the  transform.  At  last,  after  the  CPS  transform,  the  binding  of 
the  letrec  expression  will  contain  only  those  bexp  instructions  defined  in  section  5.1. 

Our  treatment  of  the  letrec  form  saves  unnecessary  indirection  caused  by  mutual 
recursion.  Keeping  the  constructs  for  defining  mutually  recursive  functions  allows  as 
many  functions  as  possible  to  share  the  same  closure  record,  as  long  as  we  define  them  in 
the  same  letrec.  This  is  what  the  next  compilation  stage,  closure  conversion,  is  based 


5.4  Free  Variable  Set  Computing  and  Hoisting 

As  we  have  mentioned  above,  a  compiled  Pscheme  program  is  a  big  letrec  construct  with 
many  mutually  recursive  functions  defined  at  the  tope  level.  This  requires  a  structural 
rearrangement  of  the  whole  letrec  expression.  The  principle  is  to  move  as  many  functions 
as  possible  into  the  same  letrec.  This  is  an  auxiliary  stage  called  hoisting  in  our  compiler. 

Hoisting  removes  unnecessary  nested  bindings,  and  is  performed  before  the  closure 
conversion  stage,  and  also  before  the  thread  abstraction  stage.  In  general,  transforming  a 
hoisted  expression  produces  less  redundancy  than  transforming  an  unhoisted  expression. 
On  the  other  hand,  since  the  transformation  is  a  recursive  traversal  of  the  expression 
tree,  it  usually  generates  some  redundant  nestings.  This  auxiliary  stage  helps  remove 
them.  Remember  that  after  the  scope  analysis  stage,  every  identifier  has  a  unique  name, 
so  generally  we  need  not  worry  about  the  scope  of  any  identifier  when  rearranging  an 
expression. 

The  two  basic  cases  in  which  nesting  can  be  removed  are: 

(letrec  def-listl  (letrec  def-list2  body)) 

— ►  (letrec  append(def-listl,def-list2)  body) 
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(letrec  ((mi  e{)  .  .  (ay_i  e,-_i) 

(ay  (letrec  ((^  e^)  .  .  (ym  e'm))  body,)) 

(ay+1  *4+1  )  •  •  (^ri  ^ri,  )  ) 

body) 

— ►  (letrec  ((®i  ei)..(ay_i  e4-_i) 

(yi  ei)  .  .  (a/m  e^) 

(ay  body,) 

(ay'-|- 1  *4+1  )  •  •  (^ri  ^ri)  ) 

body) 


There  are  cases,  however,  in  which  we  have  to  take  care  of  scopes.  That  is  when  new 
bindings  are  introduced  by  lambda  expressions.  Consider  the  expression: 

(letrec  def-listl  (lambda(x)  (letrec  def-list2  body))) 

We  can  transform  it  to  be 

(letrec  append(def-listl,def-list2)  (lambda(x)  body)) 
if  x  does  not  occur  free  in  def-list2.  The  principle  applies  to  the  second  basic  case,  too. 
In  general,  if  the  binding  part  of  a  letrec  expression  should  be  lifted  out  of  a  non-letrec 
binding  such  as  a  lambda,  primop,  content,  or  assign  expression,  a  scope  check  is 
necessary  to  see  if  the  new  bound  variable  occurs  free  in  the  lifted  part. 

For  example,  the  expression 


(letrec  ((fl  (lambda(xl)  (letrec  ((yl  xl))  yl))) 
(f2  (lambda(x2)  (letrec  ((y2  1))  2)))) 
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PPS  form  : 


LET 

FI  =  LAMBDA  XI  fl 
LET 

Y 1  =  BIND  Y 1  XI 
APPLY  fl  Y1 
F2  =  LAMBDA  X2  f3 
LET 

Y2  =  BIND  Y2  1 
APPLY  f 3  2 
PRIM  +12  v8 
LET 

Z  =  BIND  Z  3 
END  F2 


Hosted  form: 

LET 

Fl  =  LAMBDA  XI  fl 
LET 

Y 1  =  BIND  Y 1  XI 
APPLY  fl  Y1 
Y2  =  BIND  Y2  1 
F2  =  LAMBDA  X2  f3 
APPLY  f 3  2 
Z  =  BIND  Z  3 
PRIM  +12  v8 
END  F2 


Figure  5.4:  An  example  of  hoisting 
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(+  1  2) 

(letrec  ((z  3))  f2)) 

can  be  transformed  as  shown  in  figure  5.4.  The  letrec  in  f2’s  binding  can  be  lifted  but 
the  one  in  f  l’s  can  not  due  to  the  free  variable  xl. 

In  order  to  perform  hoisting,  we  have  to  compute  the  free  variable  set  for  each  expres¬ 
sion.  The  free  variable  set  computation  is  also  a  recursive  traversal  through  the  expression 
tree.  The  more  tedious  part  is  to  maintain  the  free  variable  set  during  hoisting  so  that  an 
expression  node  will  correctly  have  its  free  variable  information  although  it  is  rearranged. 

The  free  variable  set  is  also  essential  for  the  the  next  transform  stage  -  closure  con¬ 
version,  a  transformation  that  makes  functions  have  no  free  variables. 

5.5  Closure  Conversion 

A  CPS  expression  is  used  in  a  compiler’s  intermediate  representation  because  it  is  very 
close  to  the  instructions  of  a  Von  Neumann  machine.  Function  calls  are  just  control 
transfers.  However,  the  notion  of  functions  is  less  primitive  than  on  the  Von  Neumann 
machine  because  a  CPS  function  has  free  variables.  Therefore,  a  function  representation 
must  have  free  variable  information  aside  from  code  information.  Such  a  representation  is 
more  complicated  than  the  instructions  on  the  Von  Neumann  machine.  Looking  up  free 
variable  information  also  makes  the  function  application  more  expensive. 

One  technique  to  transform  functions  into  forms  in  which  no  function  has  free  variables 
is  the  closure  conversion.  5  Since  a  closure  contains  information  for  the  code  as  well  as 
the  free  variables,  if  it  can  be  passed  as  a  function  argument,  all  the  information  that  a 
function  needs  can  be  found  in  it,  thus  the  function  needs  no  reference  to  those  original 
free  variables.  In  other  words,  closure  conversion  makes  the  closure  representation  explicit 
5There  is  another  technique  called  lambda  lifting  proposed  by  Thomas  Johnson  [41]. 
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at  the  language  level.  Closure  construction  and  free  variable  lookups  become  language 
instructions,  too.  The  explicit  representation  of  closure  operations  also  makes  the  internal 
representation  more  machine  independent. 

Specifically,  the  idea  is  to  let  a  function  with  free  variables  take  one  more  argument, 
its  own  closure,  so  that  any  free  variable  can  be  referenced  through  the  closure  argument. 
Therefore,  the  function  is  closed.  The  caller  of  this  function,  on  the  other  hand,  has  to 
pass  an  extra  closure  argument.  The  closure  is  constructed  in  the  same  binding  block 
where  the  function  is  defined  so  as  to  be  visible  to  all  call  sites. 

Since  the  closure  is  distinguished  from  the  function,  whenever  there  is  a  function 
object  passed,  stored  or  returned,  we  pass,  store  or  return  the  closure  object.  Whenever 
a  function  is  called,  we  invoke  the  lambda  object. 

In  our  implementation,  closures  are  records  with  several  fields  of  code  pointers  followed 
by  several  fields  of  free  variable  values.  Functions  are  all  represented  uniformly  in  a  closure 
passing  style.  However,  explicitly  constructing  a  closure  for  each  function  is  too  expensive. 
Our  hoisting  stage  has  put  as  many  function  definitions  as  possible  in  the  same  letrec 
binding  block.  These  functions  defined  in  the  same  binding  block  have  the  same  lexical 
environment,  so  it  is  reasonable  to  let  them  share  a  closure  record.  The  free  variables  in 
the  shared  closure  record  are  the  union  of  all  sharing  functions’  free  variables,  and  the 
code  pointer  fields  are  naturally  pointing  to  these  sharing  functions’  codes.  Since  free 
variable  fields  follow  code  pointer  fields,  the  closure  for  each  function  can  be  expressed  by 
the  base  address  of  the  shared  record  and  an  offset. 

Figure  5.5  shows  a  simple  example  of  how  the  closure  conversion  transforms  the  input 
expression  (letrec  ((f  (lambda(x)  x))  (f  0))).  In  the  transformed  expression,  F 
and  f4  are  two  closures  sharing  the  same  record,  and  the  code  fields  point  to  lambda  F’ 
and  f4’  respectively.  When  F’  (that  is  F’  ’  ’)  is  called,  its  self-closure  F  is  passed  as  the 
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Input  expression: 


(letrec  ((f  (lambda(x)  x)))  (f  0)) 


Continuation  passing  style:  (hoisted) 
LETREC 

F  =  LAMBDA  X  fl 
APPLY  fl  X 
f 4  =  LAMBDA  v4 

END  v4 
APPLY  F  0  f4 


Closure  passing  style:  (hoisted) 

LETREC 

F’  =  LAMBDA  F”  X  fl 
LETREC 

fl’  =  SELECT  fl  0 
APPLY  fl’  fl  X 
f 4 ’  =  LAMBDA  f4”  v4 
END  v4 
F  =  CLOSURE  F’  f4’ 
f 4  =  OFFSET  F  1 
F ’ ’ ’  =  SELECT  F  0 
APPLY  F”’  F  0  f4 


Figure  5.5:  An  example  of  closure  conversion 
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first  argument  of  F’  (although  there  is  no  free  variable  in  the  original  F).  Again,  hoisting 
is  performed  after  closure  conversion  for  subsequent  processing. 

Closure  sharing  is  sometimes  restricted.  Functions  defined  in  the  same  letrec  binding 
cannot  shared  a  closure  record  if  it  results  in  circularity.  For  example,  the  input  expression 
in  figure  5.6  is  legal  because  there  is  no  circularity  among  non-lambda  objects.  But, 
closure  conversion  with  sharing  creates  circularity  among  closure  F  and  variable  X.  A 
correct  transformation  needs  a  dependency  check  to  partition  bindings  into  groups  for 
shared  closure  records.  In  the  second  transformation,  F  and  G  do  not  share  a  closure,  and 
no  circularity  is  created. 

5.6  Thread  Abstraction 

Our  execution  model  is  to  express  a  sequential  thread  with  a  series  of  CPS  function 
applications,  and  to  describe  parallelism  through  data  structures  (ports)  and  the  thread 
spawning  construct  para,  which  takes  a  list  of  CPS  expressions.  The  Pscheme  compiler 
abstracts  these  CPS  expressions  into  the  form  of  function  applications.  That  is,  using  one 
function  application  to  represent  a  series  of  function  applications. 

Thread  abstraction  is  the  last  stage  of  transformation.  It  transforms  argument  expres¬ 
sions  of  every  para  expression  into  a  function  application  form  so  that  it  is  easy  to  store 
it  into  data  structures  in  the  run  time  system. 

A  thread  can  be  abstracted  by  computing  its  free  variables,  and  creating  a  new  function 
taking  those  free  variables  as  arguments.  For  example,  the  expression 

PARA 

PASS  1  pi 
PASS  2  p2 

can  be  abstracted  to  be: 
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Input : 


(letrec  ((f  (lambdaQ  1)) 

(x  f) 

(g  ( lambda ()  x))) 

0) 

Transform  1:  closure  sharing  causes  circular  definition 
LETREC 

F’  =  LAMBDA  F’ ’  fl 
LETREC 

fl’  =  SELECT  fl  0 
APPLY  fl’  fl  1 
G’  =  LAMBDA  G’ ’  f3 
LETREC 

X’  =  SELECT  G’ ’  1 
f 3 ’  =  SELECT  f 3  0 
APPLY  f 3 ’  f 3  X’ 

F  =  CLOSURE  F’  G’  X 
G  =  OFFSET  F  1 
X  =  BIND  X  F 
END  0 

Transform  2:  closure  conversion  without  sharing 
LETREC 

F’  =  LAMBDA  F’ ’  fl 
LETREC 

fl’  =  SELECT  fl  0 
APPLY  fl’  fl  1 
F  =  CLOSURE  F’ 

X  =  BIND  X  F 
G’  =  LAMBDA  G’ ’  f3 
LETREC 

X’  =  SELECT  G’ ’  1 
f 3 ’  =  SELECT  f 3  0 
APPLY  f 3 ’  f 3  X’ 

G  =  CLOSURE  G’  X 
END  0 

Figure  5.6:  Closure  conversion  with  restricted  sharing 
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LETREC 


fl  =  LAMBDA  vl 

PASS  1  vl 
f 2  =  LAMBDA  v2 

PASS  2  v2 

PARA 

APPLY  fl  pi 
APPLY  f 2  p2 

Functions  representing  thread  abstractions  are  not  in  CPS  form.  Since  these  are  all 
known  functions  (the  compiler  knows  all  the  call  sites)  ,  we  need  not  worry  about  how  to 
call  them  (in  CPS  or  not).  Again,  hoisting  is  applied  after  thread  abstraction  to  remove 
nested  bindings. 

At  last,  we  have  our  intermediate  representation,  which  is  close  to  the  instruction  level 
of  a  Von  Neumann  machine.  All  function  definitions  are  lifted  to  the  top  level  because 
none  of  them  has  free  variables.  In  fact,  a  function  body  can  have  free  reference  to  other 
function  (lambda)  names,  but  those  are  considered  label  constants,  not  free  variables. 
The  code  generator  knows  all  these  definitions,  too,  and  will  have  no  problem  generating 
code  for  such  free  references. 

We  show  the  whole  transform  process  through  the  parallel  sum  example  in  appendix 

A. 

5.7  Heap-based  Code  Generation 

To  generate  C  code  that  are  as  portable  as  possible,  we  leave  the  implementation  of  ports 
and  all  the  parallel  control  primitives  to  the  run  time  system. 

The  execution  model  that  Pscheme  compiler  assumes  is  queue-based  multiprocessing 
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with  a  shared  heap.  That  is,  any  thread  newly  created  is  put  in  a  global  shared  task 
queue.  A  number  of  processors  keep  taking  thread  abstractions  from  the  task  queue  to 
execute.  Coordination  among  threads  are  through  ports.  Any  thread  returning  a  value 
to  the  top  level  signals  the  completion  of  the  whole  computation. 

Due  to  the  global  naming  space,  we  use  a  shared  heap  to  store  all  objects.  Since  C 
is  not  a  tail  recursive  language,  our  CPS  representation  is  translated  into  C  differently. 
Because  a  sequential  thread  is  composed  of  a  series  of  tail  calls,  that  is,  a  series  of  control 
transfers,  we  can  use  a  dispatching  function  to  transfer  control.  Calling  a  function  can  be 
translated  as  the  caller  returns  the  callee’s  code  address  to  the  dispatching  function,  and 
let  the  dispatching  function  call  it.  Thus,  we  have  the  same  effect  as  the  tail  recursive 
language. 

For  each  thread,  we  only  need  to  allocate  one  activation  frame  because  the  caller’s 
frame  can  always  be  overwritten  by  the  callee’s.  All  pointers  (for  instance,  in  a  cons-cell, 
port,  or  closure)  go  to  the  shared  heap,  so  overwriting  the  caller’s  frame  will  not  cause 
the  loss  of  information  about  any  object. 

The  heap  is  an  array  in  C.  The  frame  for  each  thread  works  as  a  set  of  virtual  registers 
on  a  processor.  The  size  need  not  be  large  for  every  CPS  function  is  light  weight  and 
contains  only  a  few  temporary  variables.  The  frame’s  actual  layout  may  depend  on  the 
platform.  Currently,  we  use  register  variables  in  C,  and  do  not  perform  real  register 
allocation. 

The  dispatching  function  interacts  with  the  compiled  functions  through  a  dispatching 
loop  so  as  to  achieve  the  same  effect  as  Pscheme  function’s  tail  calls.  A  simplified  version 
of  the  dispatching  loop  may  look  as  follows: 

while  (!Done) 

start  =  (function_ptr)  (*start)(); 


92 


Given  a  thread  abstraction,  we  can  initialize  start  to  be  our  starting  function  to  apply. 

The  intermediate  representation  of  a  transformed  Pscheme  program  is  a  big  letrec 
form  with  all  the  lambdas  defined  at  its  top  level  (which  is  good  for  C  does  not  support 
nested  defined  functions).  Before  code-generation,  the  compiler  computes  the  frame  size 
(that  is,  number  of  arguments  and  temporary  variables)  needed  for  each  lambda.  This  is 
the  necessary  information  for  the  garbage  collector  to  traverse  activation  frames. 

Generating  code  for  binding  a  variable  to  a  lambda  /  causes  the  definition  of  a  C 
function  /,  and  the  construction  of  a  lambda  object  containing  information  for  the  code 
pointer  to  /,  its  frame  size,  and  locking  information  (explained  in  the  next  chapter).  Gen¬ 
erating  code  for  binding  a  variable  to  a  non-lambda  object  simply  causes  the  construction 
of  the  object.  The  main  body  of  the  top  level  letrec  is  abstracted  into  a  function  with 
zero  arguments.  An  application  of  this  function  is  put  in  the  global  task  queue  initially 
so  that  the  run  time  system  can  start  the  whole  computation  with  this  initialized  task 
queue. 

All  parallel  primitives  are  translated  into  function  calls  to  the  run  time  system.  These 
operations,  as  well  as  Scheme’s  primitive  operations  are  the  only  function  call  instructions 
in  the  code  generated  by  the  compiler. 
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Chapter  6 

The  Run  Time  System 

6.1  The  Execution  Model  and  The  Current 
Platform 

Our  run  time  system  assumes  a  shared  memory  machine.  As  mentioned  in  last  chapter, 

the  execution  model  is  characterized  as  follows,  and  is  shown  in  figure  6.1. 

•  queue-based  multi-processing:  A  fixed  number  of  virtual  processors  keep  executing 
threads  in  a  global  shared  task  queue.  [21] 

•  A  thread  represents  a  series  of  tail  calls  to  compiled  Pscheme  functions. 

•  A  tail  call  transfers  control  back  to  the  dispatching  function,  and  the  callee  is  then 
dispatched. 

•  Thread  coordination  is  achieved  through  shared  ports  and  the  shared  task  queue. 

•  All  the  shared  objects  are  in  the  global  shared  heap,  and  the  heap  can  be  garbage 
collected. 

The  run  time  system  contains 


•  the  implementation  of  all  parallel  primitives, 


Activation  Frames 


Virtual  Processors 


Shared  Space 
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•  an  ordering  mechanism, 


•  a  garbage  collector  for  the  shared  heap, 

•  a  function  dispatcher  for  sequential  threads, 

•  a  resource  manager  providing  semaphores,  queues,  locks  etc.,  and 

•  the  implementation  of  all  Scheme’s  primitive  functions. 

It  is  currently  running  on  NYU  Ultra  II  computer,  which  is  a  shared  memory  8-processor 
machine  [26,57].  Processors  are  8M  Hz  MC68010  chips.  The  machine  has  a  32KB  cache 
and  4MB  user  memory.  When  the  16-processor  64  MB  Ultra  III  is  available,  we  will  be 
able  to  port  our  system  onto  it. 

The  key  element  of  the  Ultra  computer  is  the  hardware  primitive  fetch-and-add  op¬ 
eration,  which  atomically  adds  an  integer  v  to  an  integer  variable  n  in  the  memory,  and 
returns  the  old  value  of  n.  Since  fetch-and-add  is  implemented  with  non-blocking  parallel 
access  to  a  shared  word,  it  is  naturally  used  to  implement  locks,  semaphores,  synchronizing 
barriers  or  parallel  FIFO  queues.  [27] 

Provided  with  these  mechanisms,  it  is  easy  for  our  run  time  system  to  provide  a 
shared  heap  with  parallel  access  because  the  heap  top  pointer  can  be  fetch- and- add ’ed  in 
parallel.  Ports  also  support  parallel  access  due  to  parallel  FIFO  queues,  and  together  with 
locks  and  semaphores,  they  help  implement  an  ordering  mechanism,  which  is  essential  for 
exclusive  functions  and  stream  based  programming.  Also,  of  course,  the  global  task  queue 
is  accessed  in  parallel.  Synchronizing  barriers  are  used  in  our  garbage  collection  protocol 
to  stop  virtual  processors  and  to  reactivate  them. 
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6.2  Implementation  of  The  Run  Time  System 


Before  starting  to  execute  the  compiled  program,  the  run  time  system  will  spawn  a  fixed 
number  of  virtual  processors  (no  more  than  the  number  of  real  processors  -  8),  which  are 
unix  processes  executing  the  same  code.  All  of  them  are  concurrently  reading  the  global 
task  queue  and  trying  to  take  thread  abstractions  off  the  queue  to  execute.  The  global 
task  queue  has  only  one  thread  abstraction  in  the  beginning.  When  a  thread  abstraction 
is  taken  by  a  virtual  processor,  it  is  handed  to  a  dispatching  function,  and  a  series  of 
function  calls  follows  as  it  enters  the  dispatching  loop.  Those  virtual  processors  having 
no  threads  to  execute  will  keep  reading  the  global  task  queue  until  it  gets  one  to  execute. 

Whenever  a  virtual  processor  executes  a  para  command,  one  or  more  new  thread 
abstractions  will  be  put  into  the  task  queue.  Whenever  one  executes  a  die  command, 
or  any  parallel  primitive  causing  it  to  idle  itself,  it  performs  a  context  switch ,  meaning  it 
accesses  back  to  the  task  queue  to  take  the  next  thread  available  to  execute. 

During  the  execution  of  a  sequential  thread,  there  can  be  primitive  function  calls. 
Recall  that  in  the  intermediate  representation,  the  primitive  function  call  is  in  a  semi-CPS 
form.  That  is,  no  continuation  argument  is  explicitly  passed  as  its  argument.  Instead,  the 
result  is  bound  to  a  new  variable,  and  the  computation  proceeds  with  this  new  binding. 
Therefore,  all  primitive  function  calls  are  directly  implemented  in  terms  of  C  function 
calls  without  going  through  the  dispatcher. 

Dynamic  type  checking  is  performed  at  each  entrance  to  primitive  functions.  Symbolic 
primitive  operations  are  mostly  heap  based  pointer  manipulation.  Numerical  primitive 
operations  are  not  implemented  according  to  Scheme’s  number  system.  Instead,  we  use 
C’s  number  system. 

Every  object  has  a  4-bit  type  tag.  The  types  are  CHAR,  BOOLEAN,  QUOTED _SBL,  INTEGER, 
REAL,  CELL,  C0NS_CELL,  LAMBDA,  CLOSURE,  STRING,  VECTOR,  SYNC_P0RT,  ASYNC_PORT,  and 
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SYNCHRONIZER.  The  constant  nil  is  of  type  BOOLEAN,  and  has  the  value  FALSE.  Ports 
captured  at  the  language  level  are  of  type  ASYNC_PORT,  and  ports  constructed  implicitly 
by  pcall  are  of  type  SYNC_P0RT.  String  constants  are  not  stored  in  the  heap.  Instead,  they 
are  in  an  auxiliary  shared  memory  which  will  be  described  in  the  next  section.  Detailed 
internal  representation  of  objects  can  be  found  in  Appendix  B. 

6.3  Storage  Management 

In  our  run  time  system,  each  virtual  processor  executes  a  sequential  thread  that  is  com¬ 
posed  of  a  series  of  tail  calls.  The  local  space  it  needs  is  the  current  activation  frame, 
which  is  implemented  as  a  local  object  array.  Calling  a  function  causes  the  loading  of  ar¬ 
guments  onto  the  activation  frame.  Temporary  variables  in  a  function  are  also  allocated 
in  the  frame.  Heap  allocation  is  performed  whenever  a  pointer  is  needed  to  construct 
an  object.  All  pointers  point  to  the  heap  so  that  no  object  is  lost  when  a  new  frame 
overwrites  the  old  one,  and  every  pointer  can  be  shared. 

For  example,  constructing  a  closure  allocates  heap  space  for  the  shared  record  while 
the  closure  object  containing  the  record  pointer  may  be  stored  in  current  activation  frame. 
Similarly,  a  cons-cell  can  be  stored  in  the  frame  with  its  car  and  cdr  fields  pointing  to  the 
heap. 

The  heap  stores  only  shared  language-level  objects.  There  is  other  shared  space  avail¬ 
able  for  data  elements  invisible  at  the  intermediate  representation  level  such  as  shared 
queues,  shared  queue  elements,  locks  and  semaphores  etc.  We  use  another  piece  of  preal¬ 
located  shared  space  for  them,  and  there  also  can  be  pointers  to  them  from  the  frame  or 
the  heap.  However,  this  piece  of  shared  space  cannot  be  garbage  collected  because  inter¬ 
nal  pointers  of  library-provided  constructs  (queues,  locks  and  so  on)  are  not  accessible. 
As  mentioned  in  last  section,  string  constants  are  also  stored  in  this  shared  space. 
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6.4  Garbage  Collection  of  The  Shared  Heap 


The  global  shared  heap  is  composed  of  two  halves.  One  is  in  use  (called  the  from- space). 
When  it  is  exhausted,  garbage  collection  is  initiated,  and  live  objects  are  to  be  copied  to 
the  other  half  (called  the  to-space).  Cheney’s  copying  collection  algorithm  [11]  is  among 
the  simplest  to  implement.  It  is  essentially  a  breadth-hrst  traversal  that  copies  objects 
from  the  from-space  to  the  to-space.  An  object  that  has  been  visited  contains  a  forwarding 
pointer  that  redirects  any  re-visits.  The  traversal  starts  from  the  activation  frames  of  all 
threads  and  all  thread  abstractions  in  the  global  task  queue. 

Since  there  are  more  than  one  virtual  processors  allocating  space  from  the  shared  heap, 
we  need  a  protocol  to  stop  all  virtual  processors  and  let  one  of  them  perform  garbage  col¬ 
lection  when  the  heap  is  exhausted. 

GC  polling  protocol: 

A  virtual  processor  stops  for  GC  only  when  it  fails  to  allocate  enough  space  from  the  heap 
or  it  sees  the  GC  request  from  another  when  context  switching. 


The  GC  polling  happens  only  when  a  virtual  processor  performs  context  switching, 
that  is,  it  has  completed  (or  cannot  go  on  further)  the  current  thread  and  ready  to  take 
another  from  the  task  queue.  When  a  virtual  processor  makes  a  GC  request,  it  checks  if 
it  is  the  first  one  requesting  GC.  If  so,  it  waits  for  every  one  to  stop  through  a  barrier.  If 
not,  it  simply  stops  at  the  barrier.  If  a  virtual  processor  is  not  polled  or  needs  no  heap 
allocation,  it  simply  keeps  running. 


100 


6.5  The  Ordering  Mechanism 

Order  preservation  is  an  essential  issue  in  Pscheme  because  ports  are  viewed  as  FIFO 
constructs,  and  different  threads  executing  the  same  code  need  to  be  ordered  through  the 
use  of  ports  and  exclusive  functions.  This  is  often  seen  in  stream-based  programming 
examples.  A  stream  of  data  elements  may  go  through  several  processing  stages.  The 
output  stream  has  to  preserve  the  same  order  as  the  input  stream. 

Ordering  is  guaranteed  in  the  run  time  system  by  locking  and  the  use  of  ports.  Ports 
are  implemented  with  FIFO  queues.  An  element  can  be  dequeued  from  the  port  (syn¬ 
chronous  or  asynchronous)  only  when  it  will  not  out  race  previous  dequeued  elements. 
For  example,  if  an  asynchronous  port  p  is  connected 1  to  a  synchronous  port  q,  in  order  to 
dequeue  p  and  then  enqueue  q,  we  have  to  block  subsequent  dequeues  until  the  enqueuing 
of  q  is  done.  This  happens  in  the  case  of  a  pcall  expression  when  the  arguments’  current 
ports  are  captured.  For  example,  in  the  expression 

(pcall  f 

(call/mp  (lambda  (pi)  ...)) 

(call/mp  (lambda  (p2)  ...))) 

pi  is  a  captured  asynchronous  port,  which  may  keep  receiving  values,  but  in  the  imple¬ 
mentation,  a  synchronous  port  pi’  is  also  built  because  of  pcall.  A  value  v  dequeued 
from  pi  is  passed  to  a  continuation  k,  which  enqueues  v  into  pi’.  If  we  make  the  in¬ 
vocation  of  k  exclusive  2,  we  are  guaranteed  that  values  thrown  to  pi  repetitively  will 
be  enqueued  to  pi’  in  the  same  order.  In  this  case,  we  say  that  pi  is  connected  to  pi’ 
through  the  continuation  k. 

To  be  explained  later 

2  Actually,  k’s  access  lock  is  examined  before  v  is  dequeued  from  pi) 
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At  the  language  level,  programmers  usually  enforce  the  order  through  exclusive  func¬ 
tions.  For  example,  a  filtering  function  for  a  data  stream  must  be  exclusive  so  that  data 
elements  can  be  explicitly  thrown  to  the  port  leading  to  that  function.  When  we  are 
to  dequeue  the  port  and  then  call  the  exclusive  function,  we  have  to  get  access  to  the 
exclusive  function  (and  thus  lock  it)  first  before  any  element  is  taken  off  the  queue.  This 
is  because  once  data  elements  are  dequeued,  their  order  is  lost.  Both  asynchronous  ports 
and  synchronous  ports  can  lead  to  exclusive  functions.  This  principle  is  obeyed  in  any 
case  in  our  implementation  of  port  related  primitives  such  as  dequeue  and  pass. 

6.5.1  Implementation  of  Exclusive  Functions 

Exclusive  functions  are  not  implemented  in  the  compiler.  In  the  intermediate  representa¬ 
tion,  the  instruction  <exclusive,  /  >  constructs  a  lambda  object  /',  which  is  the  same  as 
the  lambda  object  /  except  that  its  access  is  exclusive.  An  exclusive  lambda  can  appear 
in  the  held  of  any  closure  record.  An  exclusive  closure  is  just  a  closure  with  an  exclusive 
lambda  as  its  held  0. 

Our  internal  representation  of  a  lambda  has  a  held  containing  a  pointer  to  a  semaphore. 
If  this  pointer  is  not  null,  the  lambda  is  exclusive,  and  calling  the  lambda  requires  the 
P  operation  on  the  semaphore  hrst.  If  this  pointer  is  null,  the  lambda  is  non-exclusive. 

Constructing  an  exclusive  lambda  does  not  build  a  queue  to  store  calling  requests. 
Threads  contending  for  a  call  to  an  exclusive  lambda  will  be  busy-waiting  at  the 
semaphore.  For  a  port  p  leading  to  an  exclusive  function  /,  repetitively  throwing  val¬ 
ues  to  p  has  the  same  effect  as  storing  ordered  calling  requests  to  /.  Therefore,  we  do  not 
need  another  queue  for  /  to  store  calling  requests. 

Releasing  locks  on  exclusive  lambdas  is  a  little  more  complicated  because  our  program 
is  in  CPS  form,  and  there  is  no  return  operation  in  CPS  form.  But,  since  every  function 
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except  continuations  takes  a  continuation  argument,  we  know  that  when  that  continua¬ 
tion  is  invoked,  the  original  function  returns.  Therefore,  whenever  an  exclusive  lambda 
is  called,  we  store  its  semaphore  in  its  continuation  argument.  But  the  continuation  ar¬ 
gument  is  a  closure,  so  we  actually  store  the  semaphore  in  the  closure’s  held  0,  the  real 
continuation  function  (a  lambda  object).  Later  on,  when  that  continuation  is  invoked, 
we  perform  the  V  operation  on  that  semaphore  so  that  subsequent  requests  to  call  the 
exclusive  lambda  can  be  satisfied. 

This  approach  works  under  the  assumption  that  no  continuation  is  exclusive  so  that 
the  semaphore  held  of  a  lambda  is  not  overloaded.  Since  continuations  are  all  constructed 
by  the  compiler,  they  will  not  be  exclusive.  3 

In  addition  to  continuations,  semaphores  for  exclusive  lambdas  have  to  be  stored  in 
each  virtual  processor  as  well  because  a  die  in  the  body  of  an  exclusive  lambda  also 
releases  the  lock.  As  seen  in  our  operational  semantics,  a  die  causes  a  trace  all  the  way 
to  the  top  level  to  release  the  locks  for  all  nested  exclusive  function  calls  in  the  current 
thread.  In  the  run  time  system,  each  virtual  processor  control  block  contains  a  stack 
of  semaphores.  Calling  an  exclusive  lambda  pushes  its  semaphore  onto  the  stack.,  and 
releasing  the  lock  for  an  exclusive  lambda  causes  a  pop  on  that  stack.  Clearly,  a  die 
empties  the  stack. 


3  except  when  a  continuation  is  used  to  connect  two  ports,  it  has  to  be  exclusive  to  guarantee  elements 
enqueued  to  the  second  port  have  the  same  order  as  they  are  dequeued  from  the  first  port.  (Please  refer  to 
last  section  for  more  explanation.)  But,  such  a  continuation  function  cannot  possibly  be  the  continuation 
argument  of  any  user-defined  exclusive  lambda,  so  it  does  not  violate  our  assumption. 
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Chapter  7 


Performance 


In  this  chapter,  we  show  the  performances  of  several  sample  programs.  We  will  see  how 
parallel  Pscheme  programs  scale  up  in  our  system.  We  will  also  compare  the  performance 
of  sequential  Pscheme  programs  on  Sun  work  stations  with  their  performances  on  our 
current  run  time  system. 

7.1  Sample  Parallel  Programs 

To  observe  the  speedup  of  parallel  programs  in  our  implementation,  we  use  the  four 
samples: 

•  Parallel  summation  of  integers  from  1  to  n. 

•  Quicksort  a  list  of  numbers 

•  The  N-queens  problem 

•  Computing  Fibonacci  numbers 

The  source  codes  of  these  four  samples  are  provided  in  figure  7.1  and  figure  7.1.  The 
N-queens  problem  is  to  calculate  the  number  of  solutions  for  putting  n  queens  on  a  n  by 


;  parallel  sum 


(define  (sum  m  n) 

(cond  ((eq?  m  n)  m) 

((eq?  n  (+  1  m))  (+  m  n)) 

(else  (let  ((x  (/  (+  m  n)  2))) 

(pcall  +  (sum  m  x)  (sum  (+  1  x)  n)))))) 


(define  (fib  n)  ;  Fibonacci  number 

(if  (<=  n  3) 
n 

(pcall  +  (fib  (-  n  2))  (fib  (-  n  1))))) 


(define  (qs  num-list)  ;  quicksort 

(cond  ((null?  num-list)  nil) 

((null?  (cdr  num-list))  num-list) 

(else 

(let  ((pivot  (car  num-list))) 

(let  ((1-r  (partition  pivot  (cdr  num-list)  nil  nil))) 
(pcall  append 

(qs  (car  1-r)) 

(cons  pivot  (qs  (cdr  1-r))))))))) 

(define  (partition  p  1  1-buf  r-buf) 

(if  (null?  1) 

(cons  1-buf  r-buf) 

(let  ((x  (car  1))) 

(if  «  x  p) 

(partition  p  (cdr  1)  (cons  x  1-buf)  r-buf) 
(partition  p  (cdr  1)  1-buf  (cons  x  r-buf)))))) 

(qs  (create-list  200)) 


Figure  7.1:  Three  sample  programs  for  parallel  performance 
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(define  (queen  n)  (find-all  n  nil  0)) 

(define  (find-all  n  board  len) 

(if  (eq?  n  len) 

1 

(letrec 

((loop  (lambda(v) 

(if  (safe  board  v  1) 

(pcall  + 

(find-all  n  (cons  v  board)  (+  1  len)) 
(if  (<  v  n)  (loop  (+  v  1))  0)) 

(if  (<  v  n)  (loop  (+  v  1))  0))))) 

(loop  1)))) 

(define  (safe  board  elt  distance) 

(if  (null?  board) 

#t 

(let  ((last  (car  board))) 

(if  (or  (eq?  elt  last) 

(eq?  elt  (+  last  distance)) 

(eq?  elt  (-  last  distance))) 

#f 

(safe  (cdr  board)  elt  (+  distance  1)))))) 

(queen  6) 


Figure  7.2:  Sample  program  for  the  N-queens  problem 
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n  chess  board  so  that  they  cannot  attack  one  another.  Computing  Fibonacci  numbers  is 
via  the  parallel  version  of  the  recursive  program  with  exponential  complexity. 

We  vary  the  number  of  pre-spawned  virtual  processors  to  see  how  these  programs 
speed  up.  Due  to  the  memory  limitation  (4MB  user  space),  we  are  unable  to  perform 
experiments  with  large  amount  of  parallelism  (that  is,  large  input)  .  Here  we  provide  the 
results  of  running  these  four  programs  with  small  parameters  in  figure  7.3. 

Each  of  these  programs  has  a  very  fine  granularity.  This  affects  the  absolute  per¬ 
formance  because  we  only  have  a  fixed  number  of  virtual  processors  to  execute  a  large 
number  of  threads  in  the  global  task  queue,  and  no  dynamic  control  of  parallelism  is 
provided.  On  the  other  hand,  the  speedup  is  good  for  programs  with  enough  parallelism. 
The  parallel  summation  program  and  the  computation  of  Fibonacci  numbers  scale  up  well 
because  of  large  and  stable  parallelism. 

The  quicksort  program  scales  up  the  worst  because  there  might  be  very  uneven  parti¬ 
tioning  of  a  random  list,  and  not  enough  parallelism  may  be  generated.  As  the  number  of 
virtual  processors  increases,  memory  and  bus  contention  increases.  If  there  is  not  enough 
parallelism,  the  performance  may  scale  poorly.  The  6-queen  problem  has  a  better  speedup 
than  quick  sort,  but  worse  than  parallel  sum  and  Fibonacci.  The  speedup  curve  for  these 
four  samples  is  shown  in  figure  7.4.  We  expect  better  scalability  for  the  quick  sort  and 
N-queens  problem  when  the  input  parameter  is  larger. 

Two  important  factors  of  parallel  programs’  performance  are  the  degree  of  parallelism 
and  locality.  Pscheme  is  a  shared  memory  language  (one  requiring  no  explicit  copying 
of  data  in  order  to  make  it  accessible  to  some  part  of  the  program),  and  it  provides  no 
mechanism  to  control  excessive  parallelism  dynamically.  Therefore,  the  implementation 
had  better  be  able  to  take  care  of  such  optimizations. 
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Figure  7.3:  Performance  of 
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Our  current  compiler  generates  C  code.  All  virtual  registers  and  activation  frames  have 
the  same  access  cost  as  the  shared  global  heap  unless  the  C  compiler  can  successfully  cache 
these  hot  spots.  If  the  compiler  generates  assembly  code  and  allocates  machine  registers 
directly,  the  locality  can  be  improved. 

Since  programs  in  applicative  languages  are  highly  recursive,  it  is  easy  to  create  a  lot 
of  parallel  tasks  which  consume  resources  quickly.  To  control  excessive  parallelism,  some 
languages  allow  users  to  query  the  system  load  to  decide  whether  to  generate  more  tasks. 
For  example,  Qlisp  [25]  users  can  check  if  the  global  task  queue  is  empty.  Pscheme  does 
not  provide  any  such  meta-level  information  at  the  language  level.  One  way  to  solve  this 
problem  in  the  run  time  system  is  to  apply  an  unfair  scheduling  policy  as  in  Multilisp 
implementation  [29,43,44].  That  is,  a  newly  created  task  is  put  in  a  local  LIFO  task 
queue  for  the  creating  processor.  A  processor  with  an  empty  task  queue  can  steal  tasks 
from  other  processor’s  task  queue.  The  unfair  scheduler  simulates  the  sequential  LIFO 
behavior  to  prevent  combinatorial  explosion  of  parallelism  in  a  recursive  program.  At  the 
same  time,  better  locality  can  be  achieved.  Our  run  time  system  has  not  suffered  from  the 
problem  of  excessive  parallelism  when  input  is  not  too  large  in  the  experiments.  When 
the  memory  limitation  is  relieved,  we  may  modify  the  global  task  queue  model  to  apply 
the  unfair  scheduling  policy  in  order  to  experiment  with  large  inputs. 

7.2  Stream  Based  Programs 

In  chapter  3,  we  demonstrated  two  examples  of  stream  based  programming  -  quicksort 
and  merge  sort.  The  source  code  can  be  found  in  figure  3.13  and  figure  3.14.  These  two 
examples  show  another  dimension  (vertical)  of  parallelism,  and  are  expected  to  have  a  bet¬ 
ter  performance  speedup  than  programs  exploiting  only  horizontal  parallelism.  Figure  7.5 
shows  the  results  of  running  these  two  programs  with  small  inputs. 
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Figure  7.5:  Performance  for  two  stream  based  programs 
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Stream  based  programming  is  very  resource  consuming  because  data  elements  in  a 
stream  are  implemented  as  light  weight  tasks.  On  a  machine  with  a  small  number  of 
processors  or  not  supporting  massive  parallelism,  the  absolute  performance  suffers  from 
insufficient  resource  or  excessive  parallelism.  However,  there  will  always  be  enough  par¬ 
allelism  to  keep  all  processors  busy,  so  these  programs  scale  up  better.  As  we  can  see 
in  figure  7.4,  stream  based  quicksort  has  a  larger  speedup  factor  than  normal  parallel 
quicksort  (although  contention  still  affects  it  much  when  more  processors  provided),  and 
merge  sort’s  speedup  is  even  larger.  One  reason  that  the  speedup  is  not  as  satisfactory  as 
for  psum  or  fib  is  that  processors  spend  more  time  busy  waiting  due  to  the  heavy  use  of 
exclusive  functions  (though  busy  waiting  implementation  of  exclusive  functions  decreases 
context  switching  and  improves  absolute  performance). 

The  experiment  tells  that  more  non-trivial  parallelism  can  be  obtained  via  the  use  of 
streams.  It  also  tells  us,  however,  that  stream  based  programming  is  more  feasible  when 
the  language  is  implemented  on  a  machine  supporting  massive  parallelism.  In  any  case, 
users  are  provided  with  the  possibility  to  explore  more  useful  parallelism  in  Pscheme. 

7.3  Sequential  Programs 

In  a  practical  computation,  the  absolute  performance  is  as  important  as  the  scalability. 
We  would  like  to  observe  the  execution  of  sequential  programs  to  see  how  efficient  the 
codes  that  Pscheme  compiler  generates  can  be.  To  exclude  any  slow  down  factor  in  our 
run  time  system,  we  run  the  generated  codes  on  top  of  a  sequential  run  time  system, 
which  contains  no  global  task  queue  or  virtual  processors,  but  only  a  dispatching  loop. 

Experiments  are  performed  on  a  Sun3  (MC68020  CPU,  15M  Hz  clock)  and  a  Sun4 
(SPARC,  20M  Hz  clock)  architecture  respectively.  Sequential  sample  programs  are  still 
the  summation  and  Fibonacci.  The  Pscheme  codes  are  in  figure  7.6. 
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Table  7.1:  Execution  time  (ms)  of  sequential  programs  on  different  platforms 


sum 

hb 

10000 

20000 

30000 

20 

21 

22 

Pscheme 

1754 

3590 

5260 

1492 

2413 

3908 

SCM 

766 

1633 

2216 

417 

716 

1050 

Table  7.2:  Execution  time  (ms)  of  sequential  programs  through  different  implementations 


114 


We  compare  the  performance  with  that  of  our  run  time  system  on  Ultra  computer  (8M 
Hz  MC68010  CPUs).  The  generated  C  codes  for  these  samples  are  compiled  through  the 
same  compiler:  cc.  From  table  7.1,  we  see  that  sequential  programs  can  have  a  slow  down 
factor  about  1.7  to  2  at  worst.  This  explains  how  much  the  memory  or  bus  contention 
on  Ultra  computer  can  affect  the  performance.  As  we  have  seen,  the  sequential  programs 
run  faster  on  faster  platforms.  But,  how  efficient  are  these  sequential  codes  generated 
by  the  Pscheme  compiler?  We  make  a  comparison  here  with  SCM  [40]  in  table  7.2  by 
running  the  same  Scheme  programs  on  the  same  platform  (Sun3).  It  shows  that  the  codes 
generated  by  the  Pscheme-to-C  translator  and  the  C  compiler  will  be  two  to  four  times 
slower. 

The  experiment  tells  us  how  much  the  absolute  performance  is  affected  by  the  con¬ 
tention  on  a  parallel  machine,  and  how  much  inefficiency  is  caused  by  indirect  code  gener¬ 
ation  through  C  for  the  purpose  of  portability.  This  means  that  the  continuation-passing, 
closure-passing  transform  approach  in  the  Pscheme  compiler  can  produce  intermediate 
representation  that  is  efficient  enough  for  the  code  generator.  We  expect  competitive 
performance  if  machine  codes  can  be  generated  directly. 

In  conclusion,  Pscheme  programs  will  have  a  good  performance  if 

•  the  underlying  machine  supports  massive  parallelism, 

•  improved  scheduling  policy  is  provided  in  the  run  time  system  to  control  excessive 
parallelism,  or 

•  efficient  machine  codes  can  be  generated  directly  to  better  utilize  machine  registers 
and  the  cache  memory. 

While  our  goal  is  to  add  more  expressive  power  in  a  parallel  language,  we  have  shown 
that  Pscheme  can  also  have  satisfactory  performance.  Although  experiments  are  per- 
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formed  only  on  small  inputs,  we  are  able  to  improve  the  run  time  system  in  larger  scaled 
machines  to  get  a  better  speedup.  This  explains  the  feasibility  of  representing  general 
control  in  a  parallel  applicative  language.  Therefore,  programmers  are  provided  with 
more  expressive  power  without  losing  much  performance. 
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Chapter  8 


Related  Work 


Various  parallel  programming  languages  have  been  proposed.  In  this  chapter,  we  will  make 
comparisons  among  them  based  on  language  design  issues.  The  work  of  designing  Pscheme 
has  benefited  a  lot  from  the  concepts  in  these  languages  including  two  parallel  Lisps 
Multilisp  and  Qlisp,  the  Actors  model,  Lucid,  and  proposals  for  parallel  continuations. 

8.1  Multilisp  and  Futures 

Most  parallel  Lisps  are  shared  memory  languages  with  side  effects  because,  with  explicit 
parallelism  provided,  it  is  easier  to  model  shared  objects  containing  mutable  states  with 
side  effects,  and  easier  to  express  finer  grained  parallelism  without  worrying  about  copying 
shared  data  from  one  domain  to  another.  Such  a  design  decision  also  makes  the  language 
closest  to  the  sequential  Lisp  counterpart,  and  hence  is  adopted  by  Multilisp,  Qlisp  and 
Pscheme.  On  the  other  hand,  the  way  of  introducing  parallelism  varies,  and  leads  to 
different  programming  styles. 

The  concept  of  pcall  is  natural  in  parallel  functional  programming.  Multilisp  has 
another  primary  approach  to  parallelism  -  futures  1.  Futures  enable  programmers  to 


JThe  idea  of  futures  had  been  around  for  a  while  when  Multilisp  was  proposed,  but  Multilisp’s  future 


express  parallelism  at  different  levels  of  granularity.  It  is  a  more  fundamental  construct 
than  pcall  in  that  it  characterizes  more  parallel  behaviors  and  in  fact,  it  can  implement 
pcall.  On  the  other  hand,  it  is  high  level  enough  to  save  programmers  concerns  of 
synchronization. 

Since  a  future  process  commits  the  result  it  has  computed  2  ,  a  future  value  usually 
serves  as  a  constant  once  it  becomes  available.  When  a  language  supports  continuations, 
this  becomes  a  problem.  The  code  for  the  future  process  can  be  re-entered  through  the 
invocation  of  continuations  captured  inside.  The  interaction  between  continuations  and 
futures  makes  the  semantics  for  either  construct  hard  to  understand.  It  is  suggested  that 
the  meaning  of  continuations  are  extended  in  the  spirit  that  any  mixed  use  of  futures  and 
continuations  expects  to  have  the  same  result  as  the  sequential  program  without  using 
futures.  However,  continuations,  based  on  concepts  of  sequential  computation,  cause 
problems  although  there  have  been  proposals  to  improve  the  interaction  between  futures 
and  continuations  [42].  That  is  why  we  feel  the  need  to  provide  an  alternative  parallel 
extension  for  continuations. 

Ports  do  provide  a  lower  level  construct  than  the  future,  and  can  implement  more 
other  parallel  constructs.  The  future  implemented  with  ports  as  shown  in  figure  3.7 
restricts  multiple  returned  results.  This  is  guaranteed  by  a  single-port  captured  at  the 
future  constructing  point  so  that  the  first  result  that  the  future  computes  is  committed 
as  its  value. 

Exclusiveness  is  another  concern  of  parallel  languages.  Multilisp  has  two  low-level 
primitives  replace  and  replace-if-eq  supported  by  the  Concert  platform  [3,31],  a 
32-processor  multiprocessor.  They  are  as  low  level  as  test-and-set  and  Ultra  com- 

is  the  first  implementation  on  real  parallel  machines. 

2  Otherwise,  there  could  be  problems  such  as  top  level  parallelism  as  mentioned  in  [42]. 
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puter’s  fetch- and- add,  and  are  actually  used  to  implement  semaphores,  futures  and  so 
on.  Pscheme’s  exclusive  seems  to  provide  a  higher  level  mechanism  to  express  shared 
objects  with  mutable  values. 


8.2  Qlisp 

Qlisp’s  design  goal  is  medium-grained  parallelism  [25,24,22].  It  applies  queue-based  mul¬ 
tiprocessing  to  support  a  variable  number  of  processors,  and  allows  users  to  check  the 
task  queue  to  control  excessive  parallelism. 

Computation  is  in  the  form  of  tree-structured  processes  with  AND/OR  parallelism 
relating  them.  The  OR  parallelism  is  accomplished  by  a  mechanism  called  heavyweight 
future ,  whose  value  is  computed  by  several  processes.  When  a  value  computed  by  a  process 
satisfies  an  end  test,  the  future  is  realized  (or  touched),  and  other  processes  associated 
with  the  future  are  killed. 

Qlisp  provides  primitives  for  explicit  process  synchronization,  process  killing,  and 
locking.  Common  Lisp’s  catch  and  throw  are  extended  so  that  programmers  can  use 
throw  to  kill  processes  3.  This  is  similar  to  passing  a  value  to  a  single-port  in  Pscheme, 
but  in  Pscheme,  no  process  is  killed  until  processes  themselves  commit  suicide  when 
passing  values  to  a  closed  single-port.  In  Qlisp,  the  kill-process  primitive  takes  a  future 
argument  and  kills  all  processes  associated  with  it.  Locks  are  available  at  the  language 
level  together  with  another  locking  construct  -  the  process  closure  qlambda,  which  is  the 
same  as  Pscheme’s  exclusive  function. 

A  special  feature  of  Qlisp  is  the  PMI  functions  ( partially ,  multiply  invoked  functions). 
The  idea  is  to  separate  the  coordination  of  the  arrivals  of  arguments  from  the  processing  of 

3 This  is  actually  the  invocation  of  process  continuations  except  that  they  cannot  be  upward  continua¬ 
tions,  and  only  serve  as  non-local  exits. 
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them.  Applying  such  a  function  can  be  in  various  forms  depending  on  which  parameter(s) 
is  (are)  submitted.  That  is,  a  function  can  be  partially  invoked  as  well  as  multiply  invoked. 
This  is  similar  to  passing  values  to  synchronous  ports  built  by  pcall  in  Pscheme.  With 
PMI  functions,  streams  can  be  implemented.  Qlisp’s  approach  differs  from  Pscheme  in 
that  data  elements  in  a  stream  are  submitted  to  PMI  functions  in  the  form  of  function 
application,  while  in  Pscheme,  explicit  throws  to  the  port  are  used. 

Many  constructs  are  introduced  in  Qlisp  to  express  different  concepts,  but  most  of 
them  can  be  explained  in  terms  of  ports.  In  Pscheme,  ports  have  a  more  general  meaning 
than  Qlisp’s  continuations  that  are  mainly  used  for  non-local  exits.  Qlisp’s  PMI  functions 
are  examples  of  Pscheme’s  synchronous  ports.  Qlisp’s  design  philosophy  tends  to  provide 
language  constructs  for  all  concepts  they  consider  important,  but  Pscheme  does  not  em¬ 
phasize  any  particular  aspect.  Instead,  it  allows  users  to  build  higher  level  facilities  for 
various  applications. 

8.3  Actors 

Actors  [2]  is  a  model  of  concurrent  computation,  which  integrates  functional  program¬ 
ming  and  object  based  programming  in  the  sense  that  computation  is  performed  through 
message  passing  among  concurrent  active  objects  ( actors )  whose  behaviors  are  functional 
and  can  be  changed  functionally. 

Actors  can  be  created  dynamically,  and  communication  between  them  is  asynchronous 
so  that  maximal  concurrency  can  be  exploited.  The  language  is  expressive  enough  to 
model  imperative  data  structures  (such  as  stacks),  recursive  functions  (such  as  factorial), 
and  the  nodes  in  a  dataflow  computing  environment  via  the  use  of  dynamically  created 
actors. 

Pscheme’s  approach  to  expressing  message  passing  processes  is  similar  to  the  Actors 
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model  in  the  sense  that  exclusive  function  can  also  be  created  dynamically,  and  ports  can 
be  used  to  store  incoming  messages.  Though  Scheme  takes  a  completely  different  approach 
from  Actors,  embedding  ports  in  Pscheme  makes  it  possible  to  apply  message  passing 
paradigms.  Order  preserving  channels  can  be  implemented  in  Actors  as  in  Pscheme, 
where  values  thrown  to  a  port  by  the  same  process  preserves  the  sending  order. 

Actors  define  an  orthogonal  active  construct  in  the  message-passing  paradigm,  and 
therefore  provide  a  fundamental  model  for  concurrency.  The  message  passing  model  is 
powerful  enough  to  explain  most  concepts  or  activities  in  parallel  computation.  How¬ 
ever,  since  message  passing  is  intrinsically  operational ,  it  becomes  a  contradiction  to 
the  declarative  programming  style.  Operational  reasoning  of  parallel  computation  also 
makes  programming  more  difficult.  Pscheme’s  approach  is  to  support  the  message  pass¬ 
ing  paradigm  in  a  declarative  context.  For  example,  there  is  no  need  to  model  recursion 
through  message  passing  activations.  4  Provided  with  ports,  Pscheme  programmers  are 
able  to  express  middle  grained  message  passing  processes,  and  at  the  same  time,  keep  the 
program  declarative. 

8.4  Lucid 

Lucid  is  a  data-flow  language  attempting  to  evolve  from  functional  languages  that  are 
either  Lisp-like  or  have  single- assignments  (such  as  ID  [7]).  It  abstracts  iteration  and 
recursion  into  a  higher  level  form  of  data  streams.  Data  streams  are  viewed  as  lazy 
lists  with  element  retrieving  operations  provided.  Computation  can  be  specified  through 
recursive  equations  among  stream  identifiers  while  having  an  iterational  semantics.  It  is 
a  declarative  language,  but  programmers  are  encouraged  to  think  operationally  in  terms 

4  In  the  Actors  model,  each  activation  of  a  recursive  function  becomes  a  created  actor  which  passes 
parameters  in  the  forms  of  messages. 
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of  dataflow. 


Lucid  is  designed  to  completely  depart  from  the  Von  Neumann  model  of  computation, 
and  expects  to  suit  well  on  data-flow  architectures.  In  fact,  Lucid  is  a  high  level  language 
but  not  a  parallel  language  because  programmers  have  no  control  of  parallelism,  which  is 
contrary  to  Pscheme’s  design  goal. 

8.5  Other  Models  of  Parallel  Continuations 

Hieb  and  Dybvig’s  process  continuation  [33]  provides  fine  control  of  parallel  processes  in 
an  environment  where  there  is  a  tree  structure  relationship  among  them.  That  means, 
each  process  has  a  parent,  who  spawns  it.  A  process  continuation  is  a  partial  (or  local ) 
continuation  representing  the  control  state  captured  back  to  the  spawn  point.  It  is  actually 
the  parallel  counterpart  of  Felleisen’s  T- continuation.  [15]  Such  partial  continuations  are 
functional  in  the  sense  that  invoking  them  is  like  calling  a  function  representing  the 
captured  computation.  Capturing  them  suspends  all  tasks  from  below  the  spawn  point 
to  the  capture  point.  Process  continuations  are  useful  to  control  the  parallel  search, 
implement  parallel  OR  or  kill  other  tasks.  On  other  hand,  the  degree  of  parallelism  is 
quite  limited,  and  they  only  apply  to  tree  structured  parallelism. 

PaiLisp  [38,39]  is  another  parallel  extension  of  Lisp.  It  distinguishes  single-use  and 
multiple-use  continuations.  PaiLisp  handles  continuations  according  to  the  identity  of  the 
process  that  captures  it  (processes  can  be  spawned  by  any  expression.)  When  invoking 
a  continuation,  if  the  invoking  process  is  the  capturing  process,  sequential  semantics  ap¬ 
plies  (i.e.  it  is  a  goto).  If  not,  the  invoking  process  goes  on,  and  the  capturing  process 
abandons  its  current  computation  (if  any)  and  unconditionally  jumps  to  the  control  point 
the  continuation  represents.  PaiLisp  may  not  have  such  precise  control  as  process  contin¬ 
uations,  but  a  PaiLisp  process  has  control  of  other  processes  in  addition  to  those  in  the 
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same  subtree. 


Both  versions  of  parallel  continuations,  together  with  the  suggested  continuation  model 
in  Multilisp,  have  one  point  in  common.  None  of  them  allows  the  invocation  of  a  contin¬ 
uation  to  fork  a  new  thread  within  the  invoking  process  so  that  activations  are  actually 
disjoint,  and  thus  no  race  condition  stated  above  will  occur.  Pscheme  has  no  such  limits, 
but  it  provides  a  die  to  control  the  interference  among  activations. 

8.6  Pscheme’s  Approach  to  Parallelism 

It  is  mentioned  that  we  consider  ports  to  be  the  natural  parallel  extension  of  continu¬ 
ations.  When  designing  Pscheme,  however,  we  have  no  intention  to  define  ports  so  as 
to  be  compatible  with  continuations  or  programs  with  heavy  use  of  continuations.  This 
differs  from  previous  attempts  for  parallel  continuations,  and  we  can  thus  come  up  with 
an  orthogonal  construct  that  is  well  defined  across  processes,  and  has  its  independent 
meaning. 

Pscheme  views  parallel  computation  as  a  dynamic  dataflow  diagram.  Processes  do 
not  have  to  have  a  tree-structure  relationship.  With  the  ports  available,  a  process  is 
able  to  communicate  to  one  that  would  have  been  invisible  in  the  process  tree.  Moreover, 
passing  an  object  to  a  port  has  nothing  to  do  with  transferring  control.  That  is  why  throw 
does  not  terminate  the  current  thread  (though  it  is  a  little  bit  misleading  to  continuation 
programmers). 

Since  no  tree-structure  relationship  is  among  processes,  Pscheme  provides  no  process 
killing  primitive  except  that  suicides  can  be  committed  through  die.  Of  course,  exclusive 
functions  are  needed  to  enforce  order  and  atomicity  in  the  computation  described  by  the 
dynamic  dataflow  diagram. 

We  think  that  this  aspect  (dynamic  dataflow  diagram)  of  parallel  computation  is 
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more  general,  and  makes  programmers  more  capable  of  modeling  real  world  concurrent 
activities.  On  the  other  hand,  Pscheme  programmers  are  not  suggested  to  commit  to 
a  particular  programming  paradigm.  Concepts  of  ports  have  made  programming  much 
more  flexible. 

8.7  Comparison  in  Terms  of  Expressiveness 

The  design  of  Pscheme  is  actually  an  integration  of  the  related  work  stated  above.  We 
start  from  a  parallel  Lisp  supporting  horizontal  parallelism  and  exclusive  functions,  and 
then  extend  the  sequential  control  mechanism,  continuations,  into  the  parallel  world. 
Therefore,  the  proposal  of  ports  makes  possible  parallelism  in  the  message  passing  and 
the  dataflow  style. 

In  the  example  of  stream  based  quicksort,  processes  can  be  created  and  passed  around 
dynamically,  which  demonstrates  the  expressiveness  as  in  the  message  passing  Actor 
model.  A  partition  cell  can  function  in  parallel  with  its  producer  and  two  consumers, 
which  illustrates  not  only  the  vertical  parallelism  Pscheme  can  exploit,  but  the  single- 
producer-multi-consumer  relationship  that  many  dataflow  or  non-strict/lazy  languages 
have  problems  to  express.  Pscheme  is  more  expressive  in  the  sense  that  control  flow  can 
be  relieved  from  the  call-return  structure,  which  expression  languages  are  based  on.  Also, 
ports  are  more  expressive  than  parallel  continuations  in  the  sense  that  they  are  clearly 
defined  across  thread  boundaries. 

Aside  from  those  models  motivating  the  design  of  Pscheme,  there  are  many  other 
parallel  languages  pursuing  various  programming  paradigms.  For  example,  the  concurrent 
logic  programming  paradigm  is  not  among  those  Pscheme  attempts  to  model.  In  the 
family  of  concurrent  logic  languages,  processes  are  used  to  solve  conjunctive  goals,  and 
shared  logical  variables  serve  as  communication  channels.  The  implicit  communication 
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process 
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model 
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parallelism 

Ada 

imperative 
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sync 

msg  passing 
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object  based 
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async 

msg  passing 

Futures 

applicative 

thread  5 

implicit 

speculative 

Linda 

— 

implicit 

shared  memory 
/  anonymous 

shared 

memory 

Concurrent 

Prologs 

logic 

implicit 

shared 

variable 

AND/OR 

/stream 

Dataflow 

languages 

declarative 

implicit 

implicit 

dataflow 

P  scheme 

applicative 

ex-function 
/  thread 

sync/async 

ports 

various 

Table  8.1:  Paradigms  for  various  parallel  languages 


among  processes  through  these  shared  variables  can  generates  sufficient  parallelism  to 
model  stream  based  programming,  (see  [60])  For  example,  the  stream  based  quicksort 
can  be  written  in  PARLOG  declaratively.  [28] 

Linda  [23,10]  is  a  coordination  language  that  provides  a  shared  space  (called  tuple 
space )  for  tasks  to  communicate.  Synchronization  and  dataflow  relationship  among  tasks 
is  enforced  by  tuple  space  operations  implicitly.  Linda  puts  little  restriction  on  program¬ 
ming  styles  (,  which  actually  depend  on  mother  languages  such  as  C  that  take  care  of 
the  computation  part,)  because  it  only  cares  for  the  coordination  part  in  parallel  compu¬ 
tation.  Although  a  universal  shared  memory  is  expressive  enough  to  model  all  kinds  of 
process  interaction,  its  aspect  of  parallelism  is  far  from  the  message  passing  model  in  the 
sense  that  it  ignores  the  locality  of  communication. 

Table  8.1  illustrates  the  coordination  paradigm  in  various  parallel  programming  lan¬ 
guages  or  models.  We  will  not  make  comparisons  between  the  applicative  programming 


5  A  thread  is  a  sequential  expression  in  evaluation. 
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paradigm  and  those  adopted  by  other  languages,  but  we  would  like  to  compare  Pscheme’s 
approach  to  parallelism  with  those  of  other  models. 

We  view  ports  as  higher  level  abstractions  that  contain  information  about  queuing, 
synchronization,  and  subsequent  computation.  On  the  other  hand,  the  implementation 
of  ports  makes  them  inexpensive  (and  thus  low-level)  enough  to  build  up  higher  level 
constructs  such  as  futures,  rendezvous  objects  or  asynchronous  channels.  Pscheme  pro¬ 
grammers  are  suggested  to  program  in  a  correct  style  in  the  sense  that  ports  should  be 
used  to  model  what  is  conceptually  similar  or  close  to  them. 

Speaking  of  the  process  model,  ports  can  be  as  expressive  as  actors  (first  class  pro¬ 
cesses).  In  terms  of  the  communication  model,  ports  serve  as  message  queues,  and  can 
express  synchronous  or  asynchronous  communication,  either  by  naming  the  receiver  or 
anonymously.  6  As  for  the  aspect  of  parallelism,  ports  can  express  all  in  the  table  but 
the  AND/OR  parallelism  in  most  concurrent  logic  programming  languages.  The  reason 
is  that  there  is  no  tree  structure  relationship  among  Pscheme  threads,  and  it  is  hard  to 
define  the  parent-child  relation  of  processes.  Therefore,  there  is  no  way  to  kill  processes 
implicitly  as  soon  as  their  computation  becomes  redundant.  The  shared  memory  model 
is  easy  to  express  because  Pscheme  is  a  language  with  global  name  space,  and  we  have 
mechanisms  protecting  shared  variables. 

To  sum  up,  the  concept  of  ports  is  not  far  from  major  coordination  mechanisms  in 
various  parallel  computation  models.  Pscheme  provides  optional  additional  expressiveness 
in  a  declarative  programming  context. 


6  Anonymous  communication  can  be  done  by  donating  a  channel  process  with  an  entry  port  for  incoming 
messages  and  an  exit  port  for  read  requests. 
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Chapter  9 

Future  Work  and  Conclusion 

9.1  Future  Work 

Most  of  the  work  to  be  done  in  the  future  is  regarding  the  Pscheme  implementation. 
There  will  be  improvement  in  Pscheme  compiler  as  well  as  the  run  time  system.  System 
interfaces  will  also  be  implemented  to  facilitate  programming. 

9.1.1  Compiler  Optimization 

Uniform  closure  conversion  may  generate  expensive  code  in  that  not  every  function  needs 
to  be  converted  into  closure-passing  style,  and  not  every  function  application  needs  to  first 
select  held  zero  from  a  closure  record  to  get  the  code  to  apply.  The  compiler  can  perform 
some  dataflow  analysis  to  decide  whether  to  convert  a  closed  function  into  closure-passing 
style  so  that  some  run  time  closure  operations  can  be  saved. 

The  dispatching  loop  is  expensive  because  a  C  procedure  call  happens  each  time  a 
compiled  function  is  dispatched.  If  the  compiler  can  generate  code  that  transfer  control  to 
a  known  function  through  the  label  (i.e.  via  a  goto)  without  going  back  to  the  dispatcher, 
many  procedure  calls  can  be  saved.  1  This  technique  is  used  in  the  SML-to-C  translator 
*In  a  CPS  transformed,  closure  converted  program,  instruction  execution  is  all  through  the  procedure 


by  Darditi,  Lee  and  Acharya  [61].  The  dispatcher  is  still  needed  for  functions  that  are 
not  known  by  the  compiler. 

As  is  mentioned  in  chapter  7,  to  get  better  sequential  performance,  assembly  code  can 
be  generated  for  various  platforms,  and  machine  registers  will  be  allocated  directly. 

9.1.2  Run  Time  System  Improvement 

We  would  also  like  to  have  a  more  portable  run  time  system  that  does  not  rely  on  the 
support  of  fetch- and- add  or  parallel  queues.  Queues  that  serialize  requests  can  be  used  to 
implement  ports  because  the  contention  among  different  threads  is  not  high,  but  the  global 
task  queue  had  better  be  implemented  without  serialization,  or  can  even  be  replaced  by 
hierarchical  queues  that  decrease  contention.  For  instance,  in  Multilisp,  each  process  has 
a  local  task  queue,  and  there  is  a  protocol  for  processes  to  retrieve  non-local  tasks  so  as 
to  balance  the  load. 

Dynamic  control  of  excessive  parallelism  may  also  be  implemented  in  the  run  time 
system  with  an  alternative  task  scheduling  policy  such  as  in  Multilisp. 

9.1.3  I/O  and  System  Interface 

It  is  desired  to  provide  I/O  functions  in  Pscheme  as  well  as  other  system  interface  primi¬ 
tives  such  as  dynamic  loading  facility,  foreign  language  interface,  timing  instructions  and 
so  on. 


9.2  Conclusion 

This  research  is  an  attempt  to  reason  about  control  in  parallel  computation.  Through  our 
experiments  in  the  parallel  applicative  language  Pscheme,  we  found  that  various  control 
calls  on  continuations.  A  simple  control  transfer  implemented  by  a  calling  sequence  of  several  instructions 
results  in  a  large  slowdown  factor,  so  it  should  be  avoided  as  much  as  possible. 
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concepts  such  as  synchronization,  blocking  and  ordering  can  be  explained  in  terms  of 
ports,  which  we  claim  to  be  a  general  control  mechanism. 

Modeling  computation  as  a  dynamic  dataflow  net  with  the  tangible  links  enables  us 
to  express  different  aspects  of  parallel  computation,  which  makes  it  possible  to  apply 
different  programming  paradigms  in  one  language.  Because  of  the  expressive  power,  we 
conclude  that  the  port  plays  the  same  role  in  parallel  computation  as  the  continuation 
does  in  the  sequential  world. 

Pscheme  indicates  how  expressive  a  parallel  language  can  be.  At  the  same  time,  we 
claim  that  exposing  control  details  in  parallel  computation  through  ports  makes  the  pro¬ 
grams  no  harder  to  understand  than  sequential  programs  with  heavy  use  of  continuations. 
In  other  words,  it  shows  the  extent  of  difficulty  for  a  programmer  to  reason  about  or  model 
parallel  computation. 

Is  programming  in  Pscheme  difficult?  No,  if  higher  level  control  libraries  are  provided. 
As  a  matter  of  fact,  it  makes  it  a  lot  easier  to  model  other  aspects  of  parallel  computation 
(such  as  pipelining  or  message  passing)  for  an  applicative  language  programmer.  More 
expressive  power  without  sacrificing  the  advantage  of  the  original  declarative  style  never 
makes  programming  more  difficult. 

How  efficient  are  Pscheme  programs?  We  have  seen  good  speedups  for  parallel  pro¬ 
grams.  Dataflow  programs,  however,  have  a  less  satisfactory  absolute  performance.  Also, 
programs  with  fine-grained  parallelism  are  not  efficient  enough.  Observing  the  perfor¬ 
mance  of  sequential  Pscheme  programs  and  the  scalability  of  parallel  programs,  we  think 
that,  with  architectural  support,  Pscheme  can  have  performance  as  good  as  other  dataflow 
languages  or  languages  supporting  fine-grained  parallelism. 

Our  final  conclusion  is:  representing  general  control  in  parallel  computation  is  feasible. 
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Appendix  A 

Transforming  The  Parallel  Sum 
Program 

A.l  Input  Program 

(define  (sum  m  n) 

(cond  ((eq?  m  n)  m) 

((eq?  n  (+  1  m))  (+  m  n)) 

(else  (letrec  ((x  (/  (+  m  n)  2))) 

(pcall  +  (sum  m  x)  (sum  (+  1  x)  n)))))) 

(sum  1  250) 

A. 2  Continuation  Passing  Style:  (hoisted) 

LETREC 


f 23  =  LAMBDA  v23  v24  f24 


PRIM  +  v23  v24  v25 

APPLY  f 24  v25 
SUM  =  LAMBDA  M  I  fl 
LETREC 

f 3  =  LAMBDA  v2 

APPLY  fl  v2 
PRIM  EQ?  MI  v5 
IF  v5 

THEM  APPLY  f3  M 
ELSE  LETREC 

f 7  =  LAMBDA  v6 

APPLY  f 3  v6 
PRIM  +  1  M  vll 
PRIM  EQ?  I  vll  vl2 
IF  vl2 

THEM  PRIM  +  M  I  vl5 
APPLY  f 7  vl5 
ELSE  LETREC 

X  =  PRIM  +  M  I  vl9 
PRIM  /  vl9  2  v21 

BUD  X  v21 
si  =  SYNC  f 23  2  f 7 
pi  =  BUILD-PORT  si  1 
p2  =  BUILD-PORT  si  2 
PARA 
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LETREC 


f 28  =  LAMBDA  v29 

PASS  v29  pi 
APPLY  SUM  M  X  f 28 
LETREC 

f 34  =  LAMBDA  v35 

PASS  v35  p2 
PRIM  +  1  X  v33 
APPLY  SUM  v33  I  f34 

f 37  =  LAMBDA  v39 

END  v39 
APPLY  SUM  1  250  f37 

A. 3  Closure  Passing  Style:  (hoisted) 

LETREC 

f 23  ’  =  LAMBDA  f23”  v23  v24  f24 
LETREC 

f 24 ’  =  SELECT  f24  0 
PRIM  +  v23  v24  v25 

APPLY  f 24 ’  f 24  v25 
f 3 ’  =  LAMBDA  f3”  v2 
LETREC 

f 1’  =  SELECT  f 3 ’ ’  1 
f  1”  =  SELECT  f  1’  0 
APPLY  f 1’ ’  f 1’  v2 
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f 7 ’  =  LAMBDA  f7”  v6 


LETREC 

f 3 ’ ’ ’ ’  =  SELECT  f 7 ’ ’  1 
f 3 ’ ’ ’ ’ ’  =  SELECT  f 3 ’ ’  ’  ’  0 
APPLY  f 3 ’ ’ ’ ’ ’  f 3 ’ ’ ’ ’  v6 
f 28  ’  =  LAMBDA  f28”  v29 
LETREC 

pi’  =  SELECT  f 28 ’ ’  1 
PASS  v29  pi’ 
f 34  ’  =  LAMBDA  f34”  v35 
LETREC 

p2  ’  =  SELECT  f34”  1 
PASS  v35  p2 ’ 

SUM’  =  LAMBDA  SUM”  M  I  fl 
LETREC 

f 23 ’ ’ ’  =  OFFSET  SUM’ ’  -1 
f 3  =  CLOSURE  f 3 ’  fl 
PRIM  EQ?  MI  v5 
IF  v5 

THEM  LETREC 

f3’  ”  =  SELECT  f 3  0 
APPLY  f3’  ”  f 3  M 
ELSE  LETREC 

f 7  =  CLOSURE  f 7 ’  f3 
PRIM  +  1  M  vll 
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PRIM  EQ?  I  vll  v!2 


IF  vl2 
THEN  LETREC 

f  7  ”  ’  =  SELECT  f 7  0 
PRIM  +  M  I  vl5 
APPLY  f  7  ”  ’  f 7  vl5 
ELSE  LETREC 

X  =  PRIM  +  M  I  vl9 
PRIM  /  vl9  2  v21 

BIND  X  v21 

si  =  SYNC  f 23 ’ ’ ’  2  f 7 
pi  =  BUILD-PORT  si  1 
p2  =  BUILD-PORT  si  2 
PARA 
LETREC 

f 28  =  CLOSURE  f28’  pi 
SUM  ”  ’  =  SELECT  SUM  ”  0 
APPLY  SUM’”  SUM”  MX  f28 
LETREC 

f 34  =  CLOSURE  f34’  p2 
SUM  ”  ”  =  SELECT  SUM  ”  0 
PRIM  +  1  X  v33 
APPLY  SUM””  SUM”  v33  I  f34 

f 37  ’  =  LAMBDA  f37”  v39 
END  v39 
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f 23  =  CLOSURE  f23’  SUM’  f37’ 

SUM  =  OFFSET  f23  1 
f 37  =  OFFSET  f23  2 
SUM  ”  ”  ’  =  SELECT  SUM  0 
APPLY  SUM’ ’ ’ ’ ’  SUM  1  250  f37 

A. 4  Thread  Abstraction 

LETREC 

f 23  ’  =  LAMBDA  f23”  v23  v24  f24 
LETREC 

f 24 ’  =  SELECT  f24  0 
PRIM  +  v23  v24  v25 

APPLY  f 24 ’  f 24  v25 
f 3 ’  =  LAMBDA  f3”  v2 
LETREC 

f 1’  =  SELECT  f 3 ’ ’  1 
f  1”  =  SELECT  f  1’  0 
APPLY  f 1’ ’  f 1’  v2 
f 7 ’  =  LAMBDA  f7”  v6 
LETREC 

f 3 ’ ’ ’ ’  =  SELECT  f 7 ’ ’  1 
f 3 ’ ’ ’ ’ ’  =  SELECT  f 3 ’ ’ ’ ’  0 
APPLY  f 3 ’ ’ ’ ’ ’  f 3 ’ ’ ’ ’  v6 
f 28  ’  =  LAMBDA  f28”  v29 
LETREC 
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pi’  =  SELECT  f 28 ’ ’  1 
PASS  v29  pi’ 
f 34  ’  =  LAMBDA  f34”  v35 
LETREC 

p2  ’  =  SELECT  f34”  1 
PASS  v35  p2 ’ 

f 38  =  LAMBDA  SUM >>>>>>  pi”  f28”’  X’  M’ 
LETREC 

f 28  =  CLOSURE  f28”  ’  pi” 

SUM  ”  ’  =  SELECT  SUM  ”  ”  ”  0 
APPLY  SUM’”  SUM’  >  >  >  >  >  M’  X’  f28 
f 39  =  LAMBDA  SUM >>>>>>>  p2”  f34”  ’  X”  N’ 
LETREC 

f 34  =  CLOSURE  f34’  ”  p2” 

SUM  ”  ”  =  SELECT  SUM  >>>>>>>  0 
PRIM  +  1  X”  v33 
APPLY  SUM””  SUM’”””  v33  N’  f34 
SUM’  =  LAMBDA  SUM”  M  I  fl 
LETREC 

f 23 ’ ’ ’  =  OFFSET  SUM’ ’  -1 
f 3  =  CLOSURE  f 3 ’  fl 
PRIM  EQ?  MI  v5 
IF  v5 

THEM  LETREC 

f 3  ”  ’  =  SELECT  f 3  0 
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APPLY  f3”  ’ 


f  3  M 


ELSE  LETREC 

f 7  =  CLOSURE  f 7 ’  f3 
PRIM  +  1  M  vll 
PRIM  EQ?  I  vll  vl2 
IF  vl2 
THEN  LETREC 

f  7  ”  ’  =  SELECT  f 7  0 
PRIM  +  M  I  vl5 
APPLY  f  7  ”  ’  f 7  vl5 
ELSE  LETREC 

X  =  PRIM  +  M  I  vl9 
PRIM  /  vl9  2  v21 

BIND  X  v21 

si  =  SYNC  f 23 ’ ’ ’  2  f 7 
pi  =  BUILD-PORT  si  1 
p2  =  BUILD-PORT  si  2 
PARA 

APPLY  f 38  SUM”  pi  f28’  X  M 
APPLY  f 39  SUM”  p2  f34’  XI 

f 37  ’  =  LAMBDA  f37”  v39 
EID  v39 

f 23  =  CLOSURE  f23’  SUM’  f37’ 

SUM  =  OFFSET  f23  1 
f 37  =  OFFSET  f23  2 
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SUM  ”  ”  ’  =  SELECT  SUM  0 
APPLY  SUM’ ’ ’ ’ ’  SUM  1  250  f37 
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Appendix  B 


Internal  Representation  of 
Pscheme  Objects 


typedef  unsigned  int  heap_ptr;  /*  index  of  the  heap  array  */ 

typedef  int  (*function_ptr) () ; 

struct  b_i_l_s  /*  bool_int_label  structure  */ 

{ 

heap_ptr  forwarding; 
int  val; 

>; 


struct  char_s 

{ 

heap_ptr  forwarding;  /*  forwarding  is  for  GC  */ 


>; 


char  val; 


struct  real_s 


{ 

heap_ptr  forwarding; 
float  val; 

>; 


struct  string 

{ 

heap_ptr  forwarding; 
short  length; 
char  *val; 

>; 


struct  lambda 

{ 

heap_ptr  forwarding; 
unsigned  int  is_prim 
int  hv_no; 
function_ptr  code; 
bw_sem_t  *sem; 


struct  cons_cell 


{ 

heap_ptr  car; 
heap_ptr  cdr; 

>; 


struct  cell 

{ 

heap_ptr  point_to; 

>; 


struct  closure 

{ 

heap_ptr  fields; 
short  field_no; 
short  offset; 

>; 


struct  vector 

{ 

heap_ptr  val; 
int  length; 

>; 


/*  forwarding  */ 


/*  forwarding  */ 


/*  forwarding  */ 


/*  non-negative  */ 


/*  forwarding  */ 
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struct  synchronizer 

{ 


unsigned  int  arity:4; 
uns igned  int  count : 4 ; 

heap_ptr  destination;  /*  forwarding  */ 
heap_ptr  cont ; 
bw_rwlock_t  *lock; 

queue_t  **oqs;  /*  pointer  to  shared  bytes  */ 

short  *qlens;  /*  pointer  to  shared  bytes  */ 

>; 

struct  async_port 

{ 

int  single_closed; 
heap_ptr  destination; 
queue_t  *queue; 

>; 

struct  sync_port 

{ 

short  index; 

heap_ptr  sync;  /*  forwarding  */ 

>; 


/*  2-byte  single  and  2-byte  closed  */ 
/*  forwarding  */ 
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union  u_obj 

{ 

struct  b_i_l_s  label; 
struct  b_i_l_s  int_num; 
struct  b_i_l_s  bool; 
struct  char_s  ch; 
struct  real_s  real_num; 
struct  cell  pointer; 
struct  cons_cell  pair; 
struct  lambda  *lbd; 
struct  closure  clsr; 

struct  string  str;  /*  also  used  by  quoted  symbol  */ 

struct  vector  vet; 

struct  sync_port  s_port; 

struct  async_port  a_port ; 

struct  synchronizer  *sync; 

>; 


typedef  struct 

{ 

unsigned  int  type:  4; 
union  u_obj  content; 
>  obj ; 
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